GPDB-特性实践

前段时间导师要求了解 GreenPlum数据库，后来安装和使用了一下，感觉和其他数据库没有什么不同，于是就不了了之了。现在重新看一遍 GPDB 的特性并尝试使用这些特性。

其实，官方宣传页上写的特性才是真正需要我去了解的。

特性

首先，GPDB的特性是什么？从哪里找？产品的宣传页上肯定有。

最大特性

GPDB 的最大特性就是 MPP, 即 Massively Parallel Processing。在首页上，最明显的就是这两个：

Massively Parallel，大规模并行
Analytics，分析

然后，下滑页面，两个明显的特性是：

Power at scale: High performance on petabyte-scale data volumes. PB级数据高性能处理。
True Flexibility: Deploy anywhere. 部署灵活。

主要特性

首页接着往下滑，写明了 GPDB 的主要特性：

MPP Architecture
Petabyte-Scale Loading: 加载速度随着每个额外节点的增加而增加，每个机架的加载速度超过10Tb/h ( 约为 347.22 GB/s ) 。
Innovative Query Optimization: 工业界中首个大数据工作负载的 cost-based query optimizer.
Polymorphic Data Storage：多态数据存储。完全控制表和—*分区存储、执行和压缩的配置。
Integrated In-Database Analytics： Apache MADlib提供的一个库，用于可伸缩的数据库内分析，通过用户定义的函数扩展Greenplum数据库的SQL功能。
Federated Data Access：联邦数据访问。使用Greenplum optimzer和 query processing engine 查询外部数据源。包括Hadoop、Cloud Storage、ORC、AVRO、Parquet等 Polygot 数据存储。

遗憾的是，首页上的这些图标都不能点击。所以要体验这些特性，只能自己去探索了。

TODO

从 hadoop 中查询数据。

GPDB vs. Hadoop + Hive

来源：Greenplum介绍)，这是 Alibaba 在 2011.02.17 做的汇报， GPDB 版本为 4.x.

对比 Hadoop + Hive, GPDB 的查询性能比 Hive 好，但是 GPDB 支持的集群节点数太少，最多可以 1000 个 segment, 而Hive 可支持上万个节点。

Uses the gpfdist protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe(|) and NULL (‘ ‘) is a space. Access to the external table is single row error isolation mode. If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT value), the entire external table operation fails and no rows are processed.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date, amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS SEGMENT REJECT LIMIT 5;

To create the readable ext_expenses table from CSV-formatted text files:

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'CSV' ( DELIMITER ',' )
   LOG ERRORS SEGMENT REJECT LIMIT 5;

以下是我的个人实验：

创建一个数据目录 ~/Datasets/baike, 将数据集baike_triple.txt移动到该目录下。

这里的数据集来自 CN-DBpedia, 包含 900万+ 的百科实体以及 6700万+ 的三元组关系。其中 mention2entity 信息 110万+，摘要信息 400万+，标签信息 1980万+，infobox 信息 4100万+。大小约为 4.0 G。数据实例如下：

"1+8"时代广场   中文名  "1+8"时代广场
"1+8"时代广场   地点    咸宁大道与银泉大道交叉口
"1+8"时代广场   实质    城市综合体项目
"1+8"时代广场   总建面  约11.28万方
"1.4"河南兰考火灾事故   中文名  "1.4"河南兰考火灾事故
"1.4"河南兰考火灾事故   地点    河南<a>兰考县</a>城关镇
"1.4"河南兰考火灾事故   时间    2013年1月4日

开启 gpfdist 后台：

1
2
3

gpfdist -d ~/Datasets/baike -p 8081 > /tmp/gpfdist.log 2>&1 &
ps -A | grep gpfdist # 查看进程号
30693 pts/8    00:00:00 gpfdist  # 表示进程号为 30693

选项说明：

-d directory: 指定一个目录，gpfdist 将从该目录中为可读外部表提供文件，或为可写外部表创建输出文件。如果没有指定，默认为当前目录。
-p http_port: gpfdist 提供文件要使用的HTTP端口。默认为8080。

查看日志：

gt@vm1:~$ more /tmp/gpfdist.log 
2020-01-13 02:49:38 30693 INFO Before opening listening sockets - following listening sockets are a vailable:
2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
2020-01-13 02:49:38 30693 INFO Opening listening socket succeeded
2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
Serving HTTP on port 8081, directory /home/gt/Datasets/baike

以 gpadmin 身份开启一个 psql session, 创建 tables： ext_baike用于存放加载的数据，ext_load_baike_err 用于存放加载错误的日志。

psql -h localhost -d db_kg # 进入数据库 db_kg

# 创建外部表
CREATE EXTERNAL TABLE ext_baike (
head text, rel text, tail text) 
LOCATION ('gpfdist://vm1:8081/baike_triples.txt')
FORMAT 'TEXT' (DELIMITER E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 50000;

# 创建内部存储表
CREATE TABLE tb_baike (
id SERIAL PRIMARY KEY, head text, rel text, tail text);

创建外部表语法详细： CREATE EXTERNAL TABLE .

创建外部表后，就可以直接从外部表读取数据了，例如：

db_kg=# select * from ext_baike limit 10;
      head       |   rel    |     tail      
-----------------+----------+---------------
 *裔*            | 中文名   | *裔*
 *裔*            | 作者     | Amarantine
 *裔*            | 小说进度 | 暂停
 *裔*            | 连载网站 | 晋江文学城
 *西方犯罪学概论 | BaiduTAG | 书籍
 *西方犯罪学概论 | ISBN     | 9787811399967
 *西方犯罪学概论 | 作者     | 李明琪 编
 *西方犯罪学概论 | 出版时间 | 2010-4
 *西方犯罪学概论 | 定价     | 22.00元
 *西方犯罪学概论 | 页数     | 305
(10 rows)

将外部表数据导入到内部表：

1	INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;

由于虚拟机的存储空间不足，最后运行失败。可以在日志中看到：

1
2
3

more greenplum/data/data1/primary/gpseg0/pg_log/gpdb-2020-01-13_000000.csv
2020-01-13 04:10:14.623480 UTC,,,p24832,th930718592,,,,0,,,seg0,,,,,"PANIC","53100","could not writ
e to file ""pg_xlog/xlogtemp.24832"": No space left on device",,,,,,,0,,"xlog.c"

start: 15:33:30

abnormally end: 16:03

错误:

db_kg=# INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
ERROR: gpfdist error: unknown meta type 108 (url_curl.c:1635) (seg0 slice1 127.0.1.1:6000 pid=15880) (url_curl.c:1635)
CONTEXT: External table ext_baike, file gpfdist://vm1:8081/baike_triples.txt

LIMIT 10000;

start 16:10:00

end: 17:12:42

ERROR: interconnect encountered a network error, please check your network (seg3 slice1 192.168.5
6.6:6001 pid=11612)
DETAIL: Failed to send packet (seq 1) to 127.0.1.1:56414 (pid 15913 cid -1) after 3562 retries in
3600 seconds

Cost-based Query Optimizer

当master接受到一条SQL语句，会将这条语句解析为执行计划 DAG，将 DAG 中不需要进行数据交换的划分为 slice ，join，aggregate，sort 的时候，都会涉及到 slice 的重分布，会有一个 motion 任务来执行数据的重分布。将 slice 下发到涉及到的相关 segment 中。来源：GreenPlum：基于PostgreSQL的分布式关系型数据库

参考 About Greenplum Query Processing

slice: To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.

motion: A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a targeted query plan does not require data to move across the interconnect.

a redistribute motion that moves tuples between the segments to complete the join.
A gather motion is when the segments send results back up to the master for presentation to the client. Because a query plan is always sliced wherever a motion occurs, this plan also has an implicit slice at the very top of the plan (slice 3).

Federated Data Access

MADlib 扩展

添加 madlib扩展

从 Pivotal Network 上面下载 MADlib 1.16+8 for RHEL 7.

使用 gppkg 安装:

出现错误:

1
2
3

gt@vm1:~/madlib-1.16+8-gp6-rhel7-x86_64$ gppkg -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg 
20200113:10:01:15:018921 gppkg:vm1:gt-[INFO]:-Starting gppkg with args: -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg
20200113:10:01:15:018921 gppkg:vm1:gt-[CRITICAL]:-gppkg failed. (Reason='__init__() takes exactly 17 arguments (16 given)') exiting...

最后，重新安装了系统： Centos 7. 在 Pivotal Network 上说明了 madlib 可以安装的系统：

Pivotal Network 只给出了 Redhat 6.x， Redhat 7.x 的 binary package. 之前装的 Ubuntu 18.04 并不能使用。Redhat Enterprise Linux 咱也用不起。。所以考虑使用免费的 Fedora or CentOS 系统。我装的是 CentOS 7,

[gpadmin@vm1 ~]$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

安装的 Greenplum 是 6.3 版本的。因此下载的文件是 madlib-1.16+9-gp6-rhel7-x86_64.tar.gz, 解压后如下：

[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ ls -la
total 3048
drwxr-xr-x. 2 gpadmin gpadmin     147 Jan 14 16:51 .
drwx------. 7 gpadmin gpadmin    4096 Jan 14 16:51 ..
-rw-r--r--. 1 gpadmin gpadmin 2904455 Jan 10 07:31 madlib-1.16+9-gp6-rhel7-x86_64.gppkg
-rw-r--r--. 1 gpadmin gpadmin  135530 Jan 10 07:31 open_source_license_MADlib_1.16_GA.txt
-rw-r--r--. 1 gpadmin gpadmin   61836 Jan 10 07:31 ReleaseNotes.txt

里面的 madlib-1.16+9-gp6-rhel7-x86_64.gppkg 文件需要使用 GreenPlum 的工具 gppkg 安装。安装方式为：

1 2	${GPHOME}/bin/gppgk [-i <package>\| -u <package> \| -r <name-version> \| -c] [-d <master_data_directory>] [-a] [-v]

安装：

1	${GPHOME}/bin/gppgk -i madlib-1.16+9-gp6-rhel7-x86_64.gppkg

然后，安装 MADlib Object to Database:

1	$GPHOME/madlib/bin/madpack install -s madlib -p greenplum -c gpadmin@vm1:5432/testdb

安装记录如下（需要预先安装 m4, yum install m4）：

[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ $GPHOME/madlib/bin/madpack install -p greenplum
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version    = 1.16 (/usr/local/greenplum-db-6.3.0/madlib/Versions/1.16/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.12)
madpack.py: INFO : > Preparing objects for the following modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpack.py: INFO : > - stats
madpack.py: INFO : > - svec_util
madpack.py: INFO : > - utilities
madpack.py: INFO : > - assoc_rules
madpack.py: INFO : > - convex
madpack.py: INFO : > - deep_learning
madpack.py: INFO : > - glm
madpack.py: INFO : > - graph
madpack.py: INFO : > - linear_systems
madpack.py: INFO : > - recursive_partitioning
madpack.py: INFO : > - regress
madpack.py: INFO : > - sample
madpack.py: INFO : > - summary
madpack.py: INFO : > - kmeans
madpack.py: INFO : > - pca
madpack.py: INFO : > - validation
madpack.py: INFO : Installing MADlib:
madpack.py: INFO : > Created madlib schema
madpack.py: INFO : > Created madlib.MigrationHistory table
madpack.py: INFO : > Wrote version info in MigrationHistory table
madpack.py: INFO : MADlib 1.16 installed successfully in madlib schema.

madlib 的使用

错误（在数据库 db_kg ）：

ERROR: schema “madlib” does not exist

由于使用 madpack 安装 madlib时没有指定 database, 而在 db_kg 中使用 madlib 时会出现错误，使用 madpack install-check 检查：

1
2
3

[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum -c gpadmin@vm1:5432/db_madlib_demo
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : MADlib is not installed in the schema madlib. Install-check stopped.

发现确实在数据库 db_kg 中没有安装 madlib. 看看上面 madpack install 的日志，发现它默认选择的数据库是 gpadmin, 即用户名同名数据库：

1	madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)

在数据库 db_kg 上测试发现确实已经安装：

[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum 
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
TEST CASE RESULT|Module: array_ops|array_ops.ic.sql_in|PASS|Time: 203 milliseconds
TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 1102 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 931 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 927 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 1041 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 274 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 4158 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 2050 milliseconds
TEST CASE RESULT|Module: pmml|pmml.ic.sql_in|PASS|Time: 2597 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 67 milliseconds
TEST CASE RESULT|Module: svm|svm.ic.sql_in|PASS|Time: 1479 milliseconds
TEST CASE RESULT|Module: tsa|arima.ic.sql_in|PASS|Time: 2058 milliseconds
TEST CASE RESULT|Module: stemmer|porter_stemmer.ic.sql_in|PASS|Time: 107 milliseconds
TEST CASE RESULT|Module: conjugate_gradient|conj_grad.ic.sql_in|PASS|Time: 555 milliseconds
TEST CASE RESULT|Module: knn|knn.ic.sql_in|PASS|Time: 574 milliseconds
TEST CASE RESULT|Module: lda|lda.ic.sql_in|PASS|Time: 641 milliseconds
TEST CASE RESULT|Module: stats|anova_test.ic.sql_in|PASS|Time: 147 milliseconds
TEST CASE RESULT|Module: stats|chi2_test.ic.sql_in|PASS|Time: 174 milliseconds
TEST CASE RESULT|Module: stats|correlation.ic.sql_in|PASS|Time: 426 milliseconds
TEST CASE RESULT|Module: stats|cox_prop_hazards.ic.sql_in|PASS|Time: 586 milliseconds
TEST CASE RESULT|Module: stats|f_test.ic.sql_in|PASS|Time: 129 milliseconds
TEST CASE RESULT|Module: stats|ks_test.ic.sql_in|PASS|Time: 138 milliseconds
TEST CASE RESULT|Module: stats|mw_test.ic.sql_in|PASS|Time: 118 milliseconds
TEST CASE RESULT|Module: stats|pred_metrics.ic.sql_in|PASS|Time: 779 milliseconds
TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.ic.sql_in|PASS|Time: 864 milliseconds
TEST CASE RESULT|Module: stats|t_test.ic.sql_in|PASS|Time: 145 milliseconds
TEST CASE RESULT|Module: stats|wsr_test.ic.sql_in|PASS|Time: 149 milliseconds
TEST CASE RESULT|Module: utilities|encode_categorical.ic.sql_in|PASS|Time: 425 milliseconds
TEST CASE RESULT|Module: utilities|minibatch_preprocessing.ic.sql_in|PASS|Time: 534 milliseconds
TEST CASE RESULT|Module: utilities|path.ic.sql_in|PASS|Time: 387 milliseconds
TEST CASE RESULT|Module: utilities|pivot.ic.sql_in|PASS|Time: 287 milliseconds
TEST CASE RESULT|Module: utilities|sessionize.ic.sql_in|PASS|Time: 245 milliseconds
TEST CASE RESULT|Module: utilities|text_utilities.ic.sql_in|PASS|Time: 334 milliseconds
TEST CASE RESULT|Module: utilities|transform_vec_cols.ic.sql_in|PASS|Time: 427 milliseconds
TEST CASE RESULT|Module: utilities|utilities.ic.sql_in|PASS|Time: 311 milliseconds
TEST CASE RESULT|Module: assoc_rules|assoc_rules.ic.sql_in|PASS|Time: 1328 milliseconds
TEST CASE RESULT|Module: convex|lmf.ic.sql_in|PASS|Time: 409 milliseconds
TEST CASE RESULT|Module: convex|mlp.ic.sql_in|PASS|Time: 2032 milliseconds
TEST CASE RESULT|Module: deep_learning|keras_model_arch_table.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: glm|glm.ic.sql_in|PASS|Time: 4514 milliseconds
TEST CASE RESULT|Module: graph|graph.ic.sql_in|PASS|Time: 4471 milliseconds
TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.ic.sql_in|PASS|Time: 313 milliseconds
TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.ic.sql_in|PASS|Time: 376 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|decision_tree.ic.sql_in|PASS|Time: 807 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|random_forest.ic.sql_in|PASS|Time: 649 milliseconds
TEST CASE RESULT|Module: regress|clustered.ic.sql_in|PASS|Time: 549 milliseconds
TEST CASE RESULT|Module: regress|linear.ic.sql_in|PASS|Time: 104 milliseconds
TEST CASE RESULT|Module: regress|logistic.ic.sql_in|PASS|Time: 720 milliseconds
TEST CASE RESULT|Module: regress|marginal.ic.sql_in|PASS|Time: 1230 milliseconds
TEST CASE RESULT|Module: regress|multilogistic.ic.sql_in|PASS|Time: 1052 milliseconds
TEST CASE RESULT|Module: regress|robust.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: sample|balance_sample.ic.sql_in|PASS|Time: 389 milliseconds
TEST CASE RESULT|Module: sample|sample.ic.sql_in|PASS|Time: 79 milliseconds
TEST CASE RESULT|Module: sample|stratified_sample.ic.sql_in|PASS|Time: 252 milliseconds
TEST CASE RESULT|Module: sample|train_test_split.ic.sql_in|PASS|Time: 506 milliseconds
TEST CASE RESULT|Module: summary|summary.ic.sql_in|PASS|Time: 457 milliseconds
TEST CASE RESULT|Module: kmeans|kmeans.ic.sql_in|PASS|Time: 2581 milliseconds
TEST CASE RESULT|Module: pca|pca.ic.sql_in|PASS|Time: 4804 milliseconds
TEST CASE RESULT|Module: pca|pca_project.ic.sql_in|PASS|Time: 1948 milliseconds
TEST CASE RESULT|Module: validation|cross_validation.ic.sql_in|PASS|Time: 756 milliseconds

所以在 db_kg 上安装 madlib 即可。

在数据库内检查是否已经安装 madlib:

[gpadmin@vm1 ~]$ psql -d db_madlib_demo
psql (9.4.24)
Type "help" for help.

db_madlib_demo=# \dn madlib
 List of schemas
  Name  |  Owner  
--------+---------
 madlib | gpadmin
(1 row)

正文开始…

MADlib 的使用详见： MADlib Documentation. 内容太多，只挑几个案例学习。

Array Operations

Array Operations, copy from Array Operations:

Operation	Description
array_add()	Adds two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
sum()	Aggregate, sums vector element-wisely. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_sub()	Subtracts two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_mult()	Element-wise product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_div()	Element-wise division of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_dot()	Dot-product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_contains()	Checks whether one array contains the other. This function returns TRUE if each non-zero element in the right array equals to the element with the same index in the left array.
array_max()	This function finds the maximum value in the array. NULLs are ignored. Return type is the same as the input type.
array_max_index()	This function finds the maximum value and corresponding index in the array. NULLs are ignored. Return type is array in format [max, index], and its element type is the same as the input type.
array_min()	This function finds the minimum value in the array. NULLs are ignored. Return type is the same as the input type.
array_min_index()	This function finds the minimum value and corresponding index in the array. NULLs are ignored. Return type is array in format [min, index], and its element type is the same as the input type.
array_sum()	This function finds the sum of the values in the array. NULLs are ignored. Return type is the same as the input type.
array_sum_big()	This function finds the sum of the values in the array. NULLs are ignored. Return type is always FLOAT8 regardless of input. This function is meant to replace array_sum() in cases when a sum may overflow the element type.
array_abs_sum()	This function finds the sum of abs of the values in the array. NULLs are ignored. Return type is the same as the input type.
array_abs()	This function takes an array as the input and finds abs of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_mean()	This function finds the mean of the values in the array. NULLs are ignored.
array_stddev()	This function finds the standard deviation of the values in the array. NULLs are ignored.
array_of_float()	This function creates an array of set size (the argument value) of FLOAT8, initializing the values to 0.0.
array_of_bigint()	This function creates an array of set size (the argument value) of BIGINT, initializing the values to 0.
array_fill()	This functions set every value in the array to some desired value (provided as the argument).
array_filter()	This function takes an array as the input and keep only elements that satisfy the operator on specified scalar. It requires that the array is 1-D and all the values are NON-NULL. Return type is the same as the input type. By default, this function removes all zeros.
array_scalar_mult()	This function takes an array as the input and executes element-wise multiplication by the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_scalar_add()	This function takes an array as the input and executes element-wise addition of the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_sqrt()	This function takes an array as the input and finds square root of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_pow()	This function takes an array and a float8 as the input and finds power of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_square()	This function takes an array as the input and finds square of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
normalize()	This function normalizes an array as sum of squares to be 1. It requires that the array is 1-D and all the values are NON-NULL.
array_unnest_2d_to_1d()	This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arrays that correspond to rows of the input array as well as an ID column with values corresponding to row positions occupied by those 1-D arrays within the 2-D array.

示例：

创建数据表，并插入数据

CREATE TABLE array_tbl ( id integer NOT NULL PRIMARY KEY,
                         array1 integer[],
                         array2 integer[]
                       );
INSERT INTO array_tbl VALUES
                       ( 1, '{1,2,3,4,5,6,7,8,9}', '{9,8,7,6,5,4,3,2,1}' ),
                       ( 2, '{1,1,0,1,1,2,3,99,8}','{0,0,0,-5,4,1,1,7,6}' );

使用 functions

db_madlib_demo=# select id, 
						madlib.array_sum(array1), 
						madlib.array_sub(array2, array1), 
						madlib.array_max(array1), 
						madlib.array_min(array1), 
						madlib.array_mean(array1), 
						madlib.normalize(array1)
				from array_tbl group by id;

 id | array_sum |          array_sub          | array_max | array_min |    array_mean    |                                                     
                         normalize                                                                               
----+-----------+-----------------------------+-----------+-----------+------------------+-----------------------------------------------------
-----------------------------------------------------------------------------------------------------------------
  1 |        45 | {8,6,4,2,0,-2,-4,-6,-8}     |         9 |         1 |                5 | {0.0592348877759092,0.118469775551818,0.177704663327
728,0.236939551103637,0.296174438879546,0.355409326655455,0.414644214431365,0.473879102207274,0.533113989983183}
  2 |       116 | {-1,-1,0,-6,3,-1,-2,-92,-2} |        99 |         0 | 12.8888888888889 | {0.0100595273380576,0.0100595273380576,0,0.010059527
3380576,0.0100595273380576,0.0201190546761152,0.0301785820141728,0.995893206467704,0.0804762187044609}
(2 rows)

Low-Rank Matrix Faxtorization

这个模块用一个low-rank approximation 实现了 incomplete matrix 的分解。

Mathematically, this model seeks to find matrices U and V (also referred as factors) that, for any given incomplete matrix A, minimizes:

$||A - UV^T||_2$, s.t. $rank(UV^T) <= r$,

where $||.||_2$ denotes the Frobenius norm. Let $A$ be a $m×n$ matrix, then $U$ will be $m×r$ and $V$ will be $n×r$, in dimension, and $1≤r≪min(m,n)$. This model is not intended to do the full decomposition, or to be used as part of inverse procedure. This model has been widely used in recommendation systems (e.g., Netflix [2]) and feature selection (e.g., image processing [3]).

Function Syntax

lmf_igd_run( rel_output, # TEXT. The name of the table to receive the output.
             rel_source, # TEXT. The name of the table containing the input data.{sparse}
             col_row,	 # TEXT. The name of the column containing the row number.
             col_column, # TEXT. The name of the column containing the column number.
             col_value,	 # DOUBLE PRECISION. The value at (row, col).
             row_dim,	 # INTEGER, default: "SELECT max(col_row) FROM rel_source".
             column_dim, # INTEGER, default: "SELECT max(col_col) FROM rel_source".
             max_rank,	 # INTEGER, default: 20. The rank of desired approximation.
             stepsize,	 # DOUBLE PRECISION, default: 0.01. Hyper-parameter that decides how aggressive the gradient steps are.
             scale_factor, # DOUBLE PRECISION, default: 0.1. Hyper-parameter that decides scale of initial factors.
             num_iterations, # INTEGER, default: 10. Maximum number if iterations to perform regardless of convergence.
             tolerance # DOUBLE PRECISION, default: 0.0001. Acceptable level of error in convergence.
           )

Examples

Create table, insert data.

1 2	CREATE TABLE lmf_data (row INT, col INT, val FLOAT8, primary key (row, col)); INSERT INTO lmf_data VALUES (1, 1, 5.0), (3, 100, 1.0), (999, 10000, 2.0);

execute function

db_madlib_demo=# SELECT madlib.lmf_igd_run('lmf_model', 'lmf_data', 'row', 'col', 'val', 999,10000, 3, 0.1, 2, 10, 1e-9 );
NOTICE:  Matrix lmf_data to be factorized: 999 x 10000
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT:  SQL statement "
            CREATE TABLE lmf_model (
                id          SERIAL,
                matrix_u    DOUBLE PRECISION[],
                matrix_v    DOUBLE PRECISION[],
                rmse        DOUBLE PRECISION)"
PL/pgSQL function madlib.lmf_igd_run(character varying,regclass,character varying,character varying,character varying,integer,integer,integer,double precision,double precision,integer,double precision) line 47 at EXECUTE statement
NOTICE:  
Finished low-rank matrix factorization using incremental gradient
DETAIL:  
 * table : lmf_data (row, col, val)
Results:
 * RMSE = 3.61661832699015e-06
Output:
 * view : SELECT * FROM lmf_model WHERE id = 1
 lmf_igd_run 
-------------
           1
(1 row)

Check the result.

db_madlib_demo=# SELECT array_dims(matrix_u) AS u_dims, array_dims(matrix_v) AS v_dims
	FROM lmf_model
	WHERE id=1;
	
    u_dims    |     v_dims     
--------------+----------------
 [1:999][1:3] | [1:10000][1:3]
(1 row)

Query the result value

db_madlib_demo=# SELECT matrix_u[2:2][1:3] AS row_2_features 
	FROM lmf_model
	WHERE id = 1;

                     row_2_features                      
---------------------------------------------------------
 {{1.97037281095982,0.312463999725878,1.06016968935728}}
(1 row)

Make prediction of a missing entry (row=2, col=7654).

db_madlib_demo=# SELECT madlib.array_dot(
        matrix_u[2:2][1:3],
        matrix_v[7654:7654][1:3]
	) AS row_2_col_7654
		FROM lmf_model
		WHERE id = 1;
		
  row_2_col_7654  
------------------
 2.37682774869935
(1 row)

Singular Value Decomposition

# SVD Function for Dense Matrices
svd( source_table, 			# TEXT. Source table name (dense matrix).
     output_table_prefix,	# TEXT. Prefix for output tables. 
     row_id,				# TEXT. ID for each row.
     k,						# INTEGER. Number of singular values to compute.
     n_iterations,			# INTEGER. Number of iterations to run.
     result_summary_table	# TEXT. The name of the table to store the result summary.
);

# SVD Function for Sparse Matrices
svd_sparse( source_table,
    output_table_prefix,
    row_id,
    col_id,
    value,
    row_dim,
    col_dim,
    k,
    n_iterations,
    result_summary_table
    );

Neural Network

Classification Training Function

mlp_classification(
    source_table,
    output_table,
    independent_varname,
    dependent_varname,
    hidden_layer_sizes,
    optimizer_params,
    activation,
    weights,
    warm_start,
    verbose,
    grouping_col
    )

Regression Training Function

mlp_regression(
    source_table,
    output_table,
    independent_varname,
    dependent_varname,
    hidden_layer_sizes,
    optimizer_params,
    activation,
    weights,
    warm_start,
    verbose,
    grouping_col
    )

Optimizer Parameters

'learning_rate_init = <value>,
 learning_rate_policy = <value>,
 gamma = <value>,
 power = <value>,
 iterations_per_step = <value>,
 n_iterations = <value>,
 n_tries = <value>,
 lambda = <value>,
 tolerance = <value>,
 batch_size = <value>,
 n_epochs = <value>,
 momentum = <value>,
 nesterov = <value>'

Prediction Function

mlp_predict(
    model_table,
    data_table,
    id_col_name,
    output_table,
    pred_type
    )

…

还有其他很多的函数，没有一一的实验，需要用到了再查找吧。

总之，这个东西，个人感觉是数据库的 边缘用例 了。而 GreenPlum 的核心特性是 Analysis, 但是吧，这个东西是 Apache 的，到底可不可以算是卖点呢。。而且，使用数据库来分析数据，emmm, 总感觉有点违和。并且，Greenplum 本身也不稳定，bug 还是比较多。前面加载海量数据就失败了（gpfdist）。

当然，这个工具是方便的，在数据存储的地方就把数据处理完了。

总之，知道有这个东西就好了。