一篇文章看懂TPCx-BB(大数据基准测试工具)源码

2023-07-21

TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景,执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法(聚类、线性回归等)。为了更好地了解被测试的系统的性能,需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码,希望能够对大家理解TPCx-BB有所帮助。

代码结构

主目录($BENCH_MARK_HOME)下有:

bin
conf
data-generator
engines
tools

几个子目录。

bin下有几个 module ,是执行时需要用到的脚本:bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等

conf下有两个配置文件:bigBench.propertiesuserSettings.conf

bigBench.properties 主要设置 workload(执行的benchmarkPhases)和 power_test_0POWER_TEST 阶段需要执行的SQL查询)

默认 workload

workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1

默认 power_test_01-30

userSetting.conf 是一些基本设置,包括JAVA environment 、default settings for benchmark(database、engine、map_tasks、scale_factor ...)、HADOOP environment、

HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)

data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。

engines下是TPCx-BB支持的4种引擎:biginsights、hive、impala、spark_sql。默认引擎为hive。实际上,只有hive目录下不为空,其他三个目录下均为空,估计是现在还未完善。

tools下有两个jar包:HadoopClusterExec.jarRunBigBench.jar 。其中 RunBigBench.jar 是执行TPCx-BB测试的一个非常重要的文件,大部分程序都在该jar包内。

数据生成

数据生成相关程序和配置都在 data-generator 目录下。该目录下有一个 pdgf.jar 包和 config、dicts、extlib 三个子目录。

pdgf.jar是数据生成的Java程序,代码量很大。config下有两个配置文件:bigbench-generation.xmlbigbench-schema.xml

bigbench-generation.xml 主要设置生成的原始数据(不是数据库表)包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。

<schema name="default">
<tables>
<!-- not refreshed tables --> <!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons -->
<table name="income_band"></table>
<table name="reason"></table>
<table name="ship_mode"></table>
<table name="web_site"></table>
<!-- /tables not used in benchmark --> <!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->
<table name="date_dim" static="true"></table>
<table name="time_dim" static="true"></table>
<table name="customer_demographics" static="true"></table>
<table name="household_demographics" static="true"></table>
<!-- /static tables --> <!-- "normal" tables. split over all nodes. not generated during refresh -->
<table name="store"></table>
<table name="warehouse"></table>
<table name="promotion"></table>
<table name="web_page"></table>
<!-- /"normal" tables.--> <!-- /not refreshed tables --> <!--
refreshed tables. Generated on all nodes.
Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)
In "normal"-Phase generate table rows: [0,REFRESH_PERCENTAGE*Table.Size];
In "refresh"-Phase generate table rows: [REFRESH_PERCENTAGE*Table.Size+1, Table.Size]
.Has effect only if ${REFRESH_SYSTEM_ENABLED}==1.
-->
<table name="customer">
<scheduler name="DefaultScheduler">
<partitioner
name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">
<prePartition><![CDATA[
if(${REFRESH_SYSTEM_ENABLED}>0){
int tableID = table.getTableID();
int timeID = 0;
long lastTableRow=table.getSize()-1;
long rowStart;
long rowStop;
boolean exclude=false;
long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});
if(${REFRESH_PHASE}>0){
//Refresh part
rowStart = lastTableRow - refreshRows +1;
rowStop = lastTableRow;
if(refreshRows<=0){
exclude=true;
} }else{
//"normal" part
rowStart = 0;
rowStop = lastTableRow - refreshRows;
}
return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);
}else{
//DEFAULT
return getParentPartitioner().getDefaultPrePartition(project, table);
} ]]></prePartition>
</partitioner>
</scheduler>
</table>
<output name="SplitFileOutputWrapper">
<!-- DEFAULT output for all Tables, if no table specific output is specified-->
<output name="CSVRowOutput">
<fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.dat</fileEnding>
<delimiter>|</delimiter>
<charset>UTF-8</charset>
<sortByRowID>true</sortByRowID>
</output> <output name="StatisticsOutput" active="1">
<size>${item_size}</size><!-- a counter per item .. initialize later--> <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.csv</fileEnding>
<delimiter>,</delimiter>
<header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->
</header>
<footer></footer>

bigbench-schema.xml 设置了很多参数,有跟表的规模有关的,比如每张表的大小(记录的条数);绝大多数是跟表的字段有关的,比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的,以及限制条件。示例如下:

生成的数据大小由 SCALE_FACTOR(-f) 决定。如 -f 1,则生成的数据总大小约为1G;-f 100,则生成的数据总大小约为100G。那么SCALE_FACTOR(-f) 是怎么精确控制生成的数据的大小呢?

原因是 SCALE_FACTOR(-f) 决定了每张表的记录数。如下,customer 表的记录数为 100000.0d * ${SF_sqrt},即如果 -f 1customer 表的记录数为 100000*sqrt(1)= 10万条 ;如果 -f 100customer 表的记录数为 100000*sqrt(100)= 100万条

<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>
<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>
<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>
<property name="${gender_likelihood}" type="double">0.5</property>
<property name="${married_likelihood}" type="double">0.3</property>
<property name="${WP_LINK_MIN}" type="double">2</property>
<property name="${WP_LINK_MAX}" type="double">25</property>
  <field name="d_date" size="13" type="CHAR" primary="false">
<gen_DateTime>
<disableRng>true</disableRng>
<useFixedStepSize>true</useFixedStepSize>
<startDate>${date_dim_begin_date}</startDate>
<endDate>${date_dim_end_date}</endDate>
<outputFormat>yyyy-MM-dd</outputFormat>
</gen_DateTime>
</field>
  <field name="t_time_id" size="16" type="CHAR" primary="false">
<gen_ConvertNumberToString>
<gen_Id/>
<size>16.0</size>
<characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>
</gen_ConvertNumberToString>
</field>
<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">
<gen_Null probability="${NULL_CHANCE}">
<gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />
</gen_Null>
</field>

dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件,表里每一条记录的各个字段应该是从这些字典里生成的。

extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jarcommons-net-3.3.jarxml-apis.jarlog4j-1.2.15.jar

总结

pdgf.jar根据bigbench-generation.xmlbigbench-schema.xml两个文件里的配置(表名、字段名、表的记录条数、每个字段生成的规则),从 dicts 目录下对应的 .dict

文件获取表中每一条记录、每个字段的值,生成原始数据。

customer 表里的某条记录如下:

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com   gdMmGdU9

如果执行 TPCx-BB 测试时指定 -f 1(SCALE_FACTOR = 1) 则最终生成的原始数据总大小约为 1G(977M+8.6M)

[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data
12.7 M 38.0 M /user/root/benchmarks/bigbench/data/customer
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/customer_address
74.2 M 222.5 M /user/root/benchmarks/bigbench/data/customer_demographics
14.7 M 44.0 M /user/root/benchmarks/bigbench/data/date_dim
151.5 K 454.4 K /user/root/benchmarks/bigbench/data/household_demographics
327 981 /user/root/benchmarks/bigbench/data/income_band
405.3 M 1.2 G /user/root/benchmarks/bigbench/data/inventory
6.5 M 19.5 M /user/root/benchmarks/bigbench/data/item
4.0 M 12.0 M /user/root/benchmarks/bigbench/data/item_marketprices
53.7 M 161.2 M /user/root/benchmarks/bigbench/data/product_reviews
45.3 K 135.9 K /user/root/benchmarks/bigbench/data/promotion
3.0 K 9.1 K /user/root/benchmarks/bigbench/data/reason
1.2 K 3.6 K /user/root/benchmarks/bigbench/data/ship_mode
3.3 K 9.9 K /user/root/benchmarks/bigbench/data/store
4.1 M 12.4 M /user/root/benchmarks/bigbench/data/store_returns
88.5 M 265.4 M /user/root/benchmarks/bigbench/data/store_sales
4.9 M 14.6 M /user/root/benchmarks/bigbench/data/time_dim
584 1.7 K /user/root/benchmarks/bigbench/data/warehouse
170.4 M 511.3 M /user/root/benchmarks/bigbench/data/web_clickstreams
7.9 K 23.6 K /user/root/benchmarks/bigbench/data/web_page
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/web_returns
127.6 M 382.8 M /user/root/benchmarks/bigbench/data/web_sales
8.6 K 25.9 K /user/root/benchmarks/bigbench/data/web_site

执行流程

要执行TPCx-BB测试,首先需要切换到TPCx-BB源程序的目录下,然后进入bin目录,执行以下语句:

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

其中,-f、-m、-s、-j都是参数,用户可根据集群的性能以及自己的需求来设置。如果不指定,则使用默认值,默认值在 conf 目录下的 userSetting.conf 文件指定,如下:

export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"

默认 MAP_TASKS80(-m 80)SCALE_FACTOR1000(-f 1000)NUMBER_OF_PARALLEL_STREAMS2(-s 2)

所有可选参数及其意义如下:

General options:
-d 使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e 使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f 数据集的规模因子(scale factor) (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h 显示帮助
-m 数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s 并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2) Driver specific options:
-a 伪装模式执行
-b 执行期间将调用的bash脚本在标准输出中打印出来
-i 指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)
-j 指定需要执行的查询 (默认:1-30共30个查询均执行)"
-U 解锁专家模式

若指定了-U,即解锁了专家模式,则:

echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution"
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]" -D 指定需要debug的查询部分. 大部分查询都只有一个单独的部分
-p 需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q 指定需要执行哪个查询(只能指定一个)
-t 指定执行该查询时用第哪个stream
-v metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y 含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z 含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)" List of available modules:
$BIG_BENCH_ENGINE_BIN_DIR

回到刚刚执行TPCx-BB测试的语句:

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

bigBench

bigBench是主脚本,runBenchmark是module。

bigBench 里设置了很多环境变量(包括路径、引擎、STREAM数等等),因为后面调用 runBigBench.jar 的时候需要在Java程序里读取这些环境变量。

bigBench 前面都是在做一些基本工作,如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmarkrunModule() 方法:

    设置基本路径

    export BIG_BENCH_VERSION="1.0"
    export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"
    export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"
    export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"
    export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"
    export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"

    指定 core-site.xmlhdfs-site.xml 的路径

    数据生成时要用到Hadoop集群,生成在hdfs上

    export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"

export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"

```

    赋予整个包下所有可执行文件权限(.sh/.jar/.py)

    find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '.py' -exec chmod 755 {} +

```

    设置 userSetting.conf 的路径并 source

    USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"
    if [ ! -f "$USER_SETTINGS" ]
    then
    echo "User settings file $USER_SETTINGS not found"
    exit 1
    else
    source "$USER_SETTINGS"
    fi

    解析输入参数和选项并根据选项的内容作设置

    第一个参数必须是module_name

    如果没有输入参数或者第一个参数以"-"开头,说明用户没有输入需要运行的module。

    if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]
    then
    export MODULE_NAME=""
    SHOW_HELP="1"
    else
    export MODULE_NAME="$1"
    shift
    fi
    export LIST_OF_USER_OPTIONS="$@"

解析用户输入的参数

根据用户输入的参数来设置环境变量

```bash
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do

case "$OPT" in

# script options

d)

#echo "-d was triggered, Parameter: $OPTARG" >&2

USER_DATABASE="$OPTARG"

;;

D)

#echo "-D was triggered, Parameter: $OPTARG" >&2

DEBUG_QUERY_PART="$OPTARG"

;;

e)

#echo "-e was triggered, Parameter: $OPTARG" >&2

USER_ENGINE="$OPTARG"

;;

f)

#echo "-f was triggered, Parameter: $OPTARG" >&2

USER_SCALE_FACTOR="$OPTARG"

;;

h)

#echo "-h was triggered, Parameter: $OPTARG" >&2

SHOW_HELP="1"

;;

m)

#echo "-m was triggered, Parameter: $OPTARG" >&2

USER_MAP_TASKS="$OPTARG"

;;

p)

#echo "-p was triggered, Parameter: $OPTARG" >&2

USER_BENCHMARK_PHASE="$OPTARG"

;;

q)

#echo "-q was triggered, Parameter: $OPTARG" >&2

QUERY_NUMBER="$OPTARG"

;;

s)

#echo "-t was triggered, Parameter: $OPTARG" >&2

USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"

;;

t)

#echo "-s was triggered, Parameter: $OPTARG" >&2

USER_STREAM_NUMBER="$OPTARG"

;;

U)

#echo "-U was triggered, Parameter: $OPTARG" >&2

USER_EXPERT_MODE="1"

;;

v)

#echo "-v was triggered, Parameter: $OPTARG" >&2

USER_POPULATE_FILE="$OPTARG"

;;

w)

#echo "-w was triggered, Parameter: $OPTARG" >&2

USER_REFRESH_FILE="$OPTARG"

;;

y)

#echo "-y was triggered, Parameter: $OPTARG" >&2

USER_QUERY_PARAMS_FILE="$OPTARG"

;;

z)

#echo "-z was triggered, Parameter: $OPTARG" >&2

USER_ENGINE_SETTINGS_FILE="$OPTARG"

;;

# driver options

a)

#echo "-a was triggered, Parameter: $OPTARG" >&2

export USER_PRETEND_MODE="1"

;;

b)

#echo "-b was triggered, Parameter: $OPTARG" >&2

export USER_PRINT_STD_OUT="1"

;;

i)

#echo "-i was triggered, Parameter: $OPTARG" >&2

export USER_DRIVER_WORKLOAD="$OPTARG"

;;

j)

#echo "-j was triggered, Parameter: $OPTARG" >&2

export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"

;;

?)

echo "Invalid option: -$OPTARG" >&2

exit 1

;;

一篇文章看懂TPCx-BB(大数据基准测试工具)源码的相关教程结束。