非root用户 使用 hive-testbench生成数据的过程
第一步,下载源码
hive-testbench源码链接:https://github.com/hortonworks/hive-testbench
git clone https://github.com/hortonworks/hive-testbench 到自己的目录
直接粘命令
下载源码:相信你会用到的
git clone https://github.com/hortonworks/hive-testbench
同时下载 在源码目录的右上方download zip包:hive-testbench-hdp3.zip
常规安装方法
a)如果是自己的linux环境,可以访问外网,使用zip生成数据完全没有问题
网上邮很多教程,例如:https://blog.****.net/huxuanlai/article/details/59484212
这类的教程有很多,这里就不做陈述。
b ) 这里要说的是公司的环境,不能访问外网,有了很多的限制,那么就只能另辟蹊径了
我们可以从zip的安装步骤找到一些蛛丝马迹
首先要执行./tpcds-build.sh
然后再执行 ./tpcds-setup.sh 100 就可以产生数据了
那么就从tpcds-build.sh入手,看一下源码:
我们对Markdown编辑器进行了一些功能拓展与语法支持,除了标准的Markdown编辑器功能,我们增加了如下几点新功能,帮助你用它写博客:
- 第一步 ,执行 ./tpcds-build.sh
- 第二步 ,执行 ./tpcds-setup.sh 100 就可以产生数据了
源码分析来执行
- tpcds-build.sh,相应源码如下:可以去github看格式完美的
- 重要的是最后倒数第二行
#!/bin/sh
# Check for all the stuff I need to function.
for f in gcc javac; do
which $f > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Required program $f is missing. Please install or fix your path and try again."
exit 1
fi
done
# Check if Maven is installed and install it if not. 检查是否安装maven
which mvn > /dev/null 2>&1
if [ $? -ne 0 ]; then
SKIP=0
if [ -e "apache-maven-3.0.5-bin.tar.gz" ]; then
SIZE=`du -b apache-maven-3.0.5-bin.tar.gz | cut -f 1`
if [ $SIZE -eq 5144659 ]; then
SKIP=1
fi
fi
if [ $SKIP -ne 1 ]; then
echo "Maven not found, automatically installing it."
curl -O http://www.us.apache.org/dist/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz 2> /dev/null
if [ $? -ne 0 ]; then
echo "Failed to download Maven, check Internet connectivity and try again."
exit 1
fi
fi
tar -zxf apache-maven-3.0.5-bin.tar.gz > /dev/null
CWD=$(pwd)
export MAVEN_HOME="$CWD/apache-maven-3.0.5"
export PATH=$PATH:$MAVEN_HOME/bin
fi
echo "Building TPC-DS Data Generator"
(cd tpcds-gen; make) # 这才是最重要的
echo "TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data."
看完源码之后发现原来是要进入tpcds-gen 文件夹,去编译
- 那我们就cd 到tpcds-gen 文件夹
cd tpcds-gen
ll
应该可以看到有5个文件分别是:
既然上一步是要在这个文件夹下执行make,那就试试呗,在tpcds-gen 夹下执行make命令
make
然后发现报错:
具体错误已经记不清了,但是是和这个链接有密切关系的,意思就是这个链接也访问不了
http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds
那就看一下哪里才能报这个错误,看一下Makefile源码吧:
all: target/lib/dsdgen.jar target/tpcds-gen-1.0-SNAPSHOT.jar
target/tpcds-gen-1.0-SNAPSHOT.jar: $(shell find -name *.java)
mvn package
target/tpcds_kit.zip: tpcds_kit.zip
mkdir -p target/
cp tpcds_kit.zip target/tpcds_kit.zip
tpcds_kit.zip:
curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/README
curl --output tpcds_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/TPCDS_Tools.zip
target/lib/dsdgen.jar: target/tools/dsdgen
cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )
target/tools/dsdgen: target/tpcds_kit.zip
test -d target/tools/ || (cd target; unzip tpcds_kit.zip)
test -d target/tools/ || (cd target; mv */tools tools)
cd target/tools; cat ../../patches/all/*.patch | patch -p0
cd target/tools; cat ../../patches/${MYOS}/*.patch | patch -p1
cd target/tools; make clean; make dsdgen
clean:
mvn clean
对于一个小白来说,看起来好费劲呀,不过这个错误还是好定位的,在tpcds_kit.zip:这个结构里出现了与错误相关的东西,第一行也就是一个readme,无所谓,第二行,意思就是去下载TPCDS_Tools.zip 然后命名为tpcds_kit.zip,既然我们现在不了TPCDS_Tools.zip,那就手动传上去吧。
下载TPCDS_Tools.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/TPCDS_Tools.zip
然后命名为 tpcds_kit.zip
tpcds_kit.zip:
curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/README
curl --output tpcds_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/TPCDS_Tools.zip
既然都手动传了,那这两行就不要了,直接注释掉
tpcds_kit.zip:
# curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/README
# curl --output tpcds_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpcds/TPCDS_Tools.zip
echo "200"
接着再次make ,发现这个问题就好了,然后又是另一个问题,
...
adding: tools/s_manufacturer.h(in = 2069) (out= 1070)(deflated 48%)
adding: tools/genrand.o(in = 59424) (out= 11949)(deflated 79%)
adding: tools/s_division.h(in = 2174) (out= 1100)(deflated 49%)
mvn package
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.notmysock.tpcds:tpcds-gen:jar:1.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 47, column 15
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 54, column 15
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] -------------------< org.notmysock.tpcds:tpcds-gen >--------------------
[INFO] Building tpcds-gen 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-resources-plugin/2.6/maven-resources-plugin-2.6.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:08 min
[INFO] Finished at: 2019-04-26T15:00:58+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Plugin org.apache.maven.plugins:maven-resources-plugin:2.6 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-resources-plugin:jar:2.6: Could not transfer artifact org.apache.maven.plugins:maven-resources-plugin:pom:2.6 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/151.101.40.215] failed: Connection timed out -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
make: *** [target/tpcds-gen-1.0-SNAPSHOT.jar] Error 1
这个错误是什么?我也不知道,但是我发现了一个似曾相识的东西,看最后一行,是不是我们熟悉的 .jar 文件
也就是没有编译出 .jar文件,
假如编译成功,那么就要执行tpcds-setup.sh 文件,那就先去看一下这个文件的源码吧,最后发现在61行的时候,需要进入到tpcds-gen/target下去找到.jar 文件
哦,原来.jar 文件才是王者,既然build的时候出现那么多错误,那就本地编译吧
-
这时候就需要在本地编译产生tpcds-gen-1.0-SNAPSHOT.jar 了
-
选择idea打开hive-teshbeanch 然后maven导入依赖的各种包,然后 compile,package
找到产生的jar文件,上传linux下的 tpcds-gen/target 文件夹 -
再次执行 ./tpcds-setup.sh 10
-
完美
注意:最后一步 ./tpcds-setup.sh 10 /xxx.db/tpds_data
最后一个参数代表hdfs目录,其中XXX.db是自己的数据库
然后就可以在该目录下查看生成的数据及大小
最后就是将这些数据导入hive 表,其中对应的建表语句在源码上也能找到
https://github.com/hortonworks/hive-testbench/blob/hdp3/ddl-tpcds/text/alltables.sql
执行相应的sql存入数据
以及相应的查询sql:一共99个 https://github.com/hortonworks/hive-testbench/tree/hdp3/sample-queries-tpcds
数据导入hive:
hive -d DB='your database name' -d LOCATION='/XXXX.db/XXX_test/3/' -f promotion.sql