元数据管理Datahub基于Docker进行部署

时间:2023-01-07 20:55:03

1. 服务器要求

要求至少有4G的内存分配给Datahub

2. 安装Docker

Docker的安装请参考我的博客centos7基于yum repository方式安装docker和卸载docker

这里安装的是Docker当前最新版,docker-ce-20.10.17、docker-ce-cli-20.10.17、containerd.io-1.6.6

3. 安装jq

jq简介:jq是一个命令行JSON解析器,使用它来进行切片、过滤、映射和转换结构化数据。类似linux的sed工具

我这里安装的是当前最新版1.6

[root@datahub ~]# wget https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
[root@datahub ~]#
[root@datahub ~]# mv jq-linux64 jq
[root@datahub ~]# 
[root@datahub ~]# chmod +x jq
[root@datahub ~]# 
[root@datahub ~]# cp jq /usr/bin
[root@datahub ~]# 
[root@datahub ~]# jq
jq - commandline JSON processor [version 1.6]

Usage:	jq [options] <jq filter> [file...]
	jq [options] --args <jq filter> [strings...]
	jq [options] --jsonargs <jq filter> [JSON_TEXTS...]

jq is a tool for processing JSON inputs, applying the given filter to
its JSON text inputs and producing the filter's results as JSON on
standard output.

The simplest filter is ., which copies jq's input to its output
unmodified (except for formatting, but note that IEEE754 is used
for number representation internally, with all that that implies).

For more advanced filters see the jq(1) manpage ("man jq")
and/or https://stedolan.github.io/jq

Example:

	$ echo '{"foo": 0}' | jq .
	{
		"foo": 0
	}

For a listing of options, use jq --help.
[root@datahub ~]# 

4. 安装python3

要求是安装python3.6+。python的安装请参考centos7同时安装Python2和Python3

我这里安装的是当前最新版python3.10.5

5. 安装docker-compose v1(deprecated,为了兼容性)

需要python的pip工具。安装的docker-compose v1是当前的最新版v1.29.2。为了避免和python的依赖包环境冲突,这里使用virtualenv方式进行安装

5.1 安装virtualenv

[root@datahub ~]# pip3 install --upgrade pip
[root@datahub ~]#
[root@datahub ~]# pip3 install virtualenv
[root@datahub ~]#
[root@datahub ~]# /root/python-3.10.5/bin/virtualenv docker-compose-v1-py --python=/root/python-3.10.5/bin/python3
[root@datahub ~]#

会在当前目录生成虚拟的python3.10.5环境目录docker-compose-v1-py

进入docker-compose-v1-py虚拟环境

[root@datahub ~]# . docker-compose-v1-py/bin/activate
(docker-compose-v1-py) [root@datahub ~]# 
(docker-compose-v1-py) [root@datahub ~]# deactivate
[root@datahub ~]#

5.2 安装docker-compose

(docker-compose-v1-py) [root@datahub ~]# pip3 install --upgrade pip
(docker-compose-v1-py) [root@datahub ~]#
(docker-compose-v1-py) [root@datahub ~]# pip3 install docker-compose

6. 安装datahub(在docker-compose-v1-py虚拟环境下)

先安装datahub的安装包

(docker-compose-v1-py) [root@datahub ~]# pip3 install --upgrade pip wheel setuptools
(docker-compose-v1-py) [root@datahub ~]# pip3 uninstall datahub acryl-datahub || true
(docker-compose-v1-py) [root@datahub ~]# pip3 install --upgrade acryl-datahub
......省略部分......
Installing collected packages: types-termcolor, types-Deprecated, termcolor, tabulate, ratelimiter, pytz, mypy-extensions, wrapt, tzdata, typing-extensions, toml, stackprinter, python-utils, python-dateutil, pyparsing, psutil, markupsafe, humanfriendly, expandvars, entrypoints, click, avro, typing-inspect, pytz-deprecation-shim, pydantic, progressbar2, packaging, mixpanel, Deprecated, click-default-group, tzlocal, avro-gen3, acryl-datahub
Successfully installed Deprecated-1.2.13 acryl-datahub-0.8.38 avro-1.10.2 avro-gen3-0.7.4 click-8.1.3 click-default-group-1.2.2 entrypoints-0.4 expandvars-0.9.0 humanfriendly-10.0 markupsafe-2.0.1 mixpanel-4.9.0 mypy-extensions-0.4.3 packaging-21.3 progressbar2-4.0.0 psutil-5.9.1 pydantic-1.9.1 pyparsing-3.0.9 python-dateutil-2.8.2 python-utils-3.3.3 pytz-2022.1 pytz-deprecation-shim-0.1.0.post0 ratelimiter-1.2.0.post0 stackprinter-0.2.6 tabulate-0.8.9 termcolor-1.1.0 toml-0.10.2 types-Deprecated-1.2.8 types-termcolor-1.1.4 typing-extensions-4.2.0 typing-inspect-0.7.1 tzdata-2022.1 tzlocal-4.2 wrapt-1.14.1
(docker-compose-v1-py) [root@datahub ~]#
(docker-compose-v1-py) [root@datahub ~]# python3 -m datahub version
DataHub CLI version: 0.8.38
Python version: 3.10.5 (main, Jun 18 2022, 17:36:43) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
(docker-compose-v1-py) [root@datahub ~]# 

安装的是当前最新的datahub 0.8.38

再启动datahub。如果重启服务器,容器都会停止运行,重新运行python3 -m datahub docker quickstart命令启动容器就可以了,之前同步的元数据不会丢失

(docker-compose-v1-py) [root@datahub ~]# wget https://raw.githubusercontent.com/datahub-project/datahub/v0.8.38/docker/quickstart/docker-compose.quickstart.yml
(docker-compose-v1-py) [root@datahub ~]#
(docker-compose-v1-py) [root@datahub ~]# python3 -m datahub docker quickstart --quickstart-compose-file /root/docker-compose.quickstart.yml
Pulling elasticsearch          ... done
Pulling elasticsearch-setup    ... done
Pulling mysql                  ... done
Pulling datahub-gms            ... done
Pulling datahub-frontend-react ... done
Pulling datahub-actions        ... done
Pulling mysql-setup            ... done
Pulling neo4j                  ... done
Pulling zookeeper              ... done
Pulling broker                 ... done
Pulling schema-registry        ... done
Pulling kafka-setup            ... done

Creating network "datahub_network" with the default driver
Creating volume "datahub_broker" with default driver
Creating volume "datahub_esdata" with default driver
Creating volume "datahub_mysqldata" with default driver
Creating volume "datahub_neo4jdata" with default driver
Creating volume "datahub_zkdata" with default driver
Creating neo4j         ... done
Creating mysql         ... done
Creating elasticsearch ... done
Creating zookeeper           ... done
Creating mysql-setup         ... done
Creating datahub-gms         ... done
Creating elasticsearch-setup       ... done
Creating broker                    ... done
Creating datahub-frontend-react    ... done
Creating datahub_datahub-actions_1 ... done
Creating schema-registry           ... done
Creating kafka-setup               ... done
......省略部分......
✔ DataHub is now running
Ingest some demo data using `datahub docker ingest-sample-data`,
or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend.
Need support? Get in touch on Slack: https://slack.datahubproject.io/
(docker-compose-v1-py) [root@datahub ~]#

7. 访问Web页面,然后导入测试元数据

然后访问http://datahub:9092,用户名和密码为datahub/datahub。如下所示

元数据管理Datahub基于Docker进行部署

然后导入一些官方提供的测试元数据

(docker-compose-v1-py) [root@datahub ~]# python3 -m datahub docker ingest-sample-data
Downloading sample data...
Downloaded to /tmp/tmpn3k2a_di.json
Starting ingestion...
[2022-06-19 10:24:34,830] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:0
[2022-06-19 10:24:35,075] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:1
[2022-06-19 10:24:35,395] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:2
......省略部分......
[2022-06-19 10:24:58,419] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:94
[2022-06-19 10:24:58,578] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:95
[2022-06-19 10:24:58,727] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit file:///tmp/tmpn3k2a_di.json:96

Source (file) report:
{'workunits_produced': 97,
 'workunit_ids': ['file:///tmp/tmpn3k2a_di.json:0',
                  'file:///tmp/tmpn3k2a_di.json:1',
                  'file:///tmp/tmpn3k2a_di.json:2',
......省略部分......
                  'file:///tmp/tmpn3k2a_di.json:94',
                  'file:///tmp/tmpn3k2a_di.json:95',
                  'file:///tmp/tmpn3k2a_di.json:96'],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.38',
 'cli_entry_location': '/root/docker-compose-v1-py/lib/python3.10/site-packages/datahub/__init__.py',
 'py_version': '3.10.5 (main, Jun 18 2022, 17:36:43) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]',
 'py_exec_path': '/root/docker-compose-v1-py/bin/python3',
 'os_details': 'Linux-3.10.0-1160.66.1.el7.x86_64-x86_64-with-glibc2.17'}
Sink (datahub-rest) report:
{'records_written': 97,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2022, 6, 19, 10, 24, 34, 550628),
 'downstream_end_time': datetime.datetime(2022, 6, 19, 10, 24, 58, 727050),
 'downstream_total_latency_in_seconds': 24.176422,
 'gms_version': 'v0.8.38'}

Pipeline finished successfully producing 97 workunits
(docker-compose-v1-py) [root@datahub ~]#

8. 删除datahub的所有containers、volumes、networks

如果还没有导入我们自己的元数据,可以使用如下命令清除Datahub的所有containers、volumes、networks(包括我们刚刚导入的官方提供的测试元数据)

(docker-compose-v1-py) [root@datahub ~]# python3 -m datahub docker nuke
Removing containers in the datahub project
Removing volumes in the datahub project
Removing networks in the datahub project
(docker-compose-v1-py) [root@datahub ~]#