一篇文章看懂TPCx-BB（大数据基准测试工具）源码

TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景，执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法（聚类、线性回归等）。为了更好地了解被测试的系统的性能，需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码，希望能够对大家理解TPCx-BB有所帮助。

代码结构

主目录（$BENCH_MARK_HOME）下有：

bin
conf
data-generator
engines
tools

几个子目录。

bin下有几个 module ,是执行时需要用到的脚本：bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等

conf下有两个配置文件：bigBench.properties 和 userSettings.conf

bigBench.properties 主要设置 workload（执行的benchmarkPhases）和 power_test_0（POWER_TEST 阶段需要执行的SQL查询）

默认 workload ：

workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1

默认 power_test_0 ：1-30

userSetting.conf 是一些基本设置，包括JAVA environment 、default settings for benchmark（database、engine、map_tasks、scale_factor ...）、HADOOP environment、

HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)

data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。

engines下是TPCx-BB支持的4种引擎：biginsights、hive、impala、spark_sql。默认引擎为hive。实际上，只有hive目录下不为空，其他三个目录下均为空，估计是现在还未完善。

tools下有两个jar包：HadoopClusterExec.jar 和 RunBigBench.jar 。其中 RunBigBench.jar 是执行TPCx-BB测试的一个非常重要的文件，大部分程序都在该jar包内。

数据生成

数据生成相关程序和配置都在 data-generator 目录下。该目录下有一个 pdgf.jar 包和 config、dicts、extlib 三个子目录。

pdgf.jar是数据生成的Java程序，代码量很大。config下有两个配置文件：bigbench-generation.xml 和 bigbench-schema.xml 。

bigbench-generation.xml 主要设置生成的原始数据（不是数据库表）包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。

<schema name="default">

		<tables>

		<!-- not refreshed tables -->	

			<!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons  -->

			<table name="income_band"></table>

			<table name="reason"></table>

			<table name="ship_mode"></table>

			<table name="web_site"></table>

			<!-- /tables not used in benchmark  -->

			<!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->

			<table name="date_dim" static="true"></table>

			<table name="time_dim" static="true"></table>

			<table name="customer_demographics" static="true"></table>

			<table name="household_demographics" static="true"></table>

			<!-- /static tables -->

			<!-- "normal" tables. split over all nodes. not generated during refresh -->

			<table name="store"></table>

			<table name="warehouse"></table>

			<table name="promotion"></table>

			<table name="web_page"></table>

			<!-- /"normal" tables.-->

		<!-- /not refreshed tables -->	

			<!--

			refreshed tables. Generated on all nodes.

			Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)

			In "normal"-Phase  generate table rows:  [0,REFRESH_PERCENTAGE*Table.Size];

			In "refresh"-Phase generate table rows:  [REFRESH_PERCENTAGE*Table.Size+1, Table.Size]

			.Has effect only if  ${REFRESH_SYSTEM_ENABLED}==1.

			-->

			<table name="customer">

				<scheduler name="DefaultScheduler">

					<partitioner

						name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">

						<prePartition><![CDATA[

					if(${REFRESH_SYSTEM_ENABLED}>0){

						int tableID = table.getTableID();

						int timeID = 0;

						long lastTableRow=table.getSize()-1;

						long rowStart;

						long rowStop;

						boolean exclude=false;

						long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});

						if(${REFRESH_PHASE}>0){

							//Refresh part

							rowStart = lastTableRow - refreshRows +1;

							rowStop  = lastTableRow;

							if(refreshRows<=0){

								exclude=true;

							}

						}else{

							//"normal" part

							rowStart = 0;

							rowStop = lastTableRow - refreshRows;

						}

						return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);

					}else{

						//DEFAULT

						return getParentPartitioner().getDefaultPrePartition(project, table);

					}

					]]></prePartition>

					</partitioner>

				</scheduler>

			</table>

<output name="SplitFileOutputWrapper">

  <!-- DEFAULT output for all Tables, if no table specific output is specified-->

    <output name="CSVRowOutput">

      <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>

      <outputDir>output/</outputDir>

      <fileEnding>.dat</fileEnding>

      <delimiter>|</delimiter>

      <charset>UTF-8</charset>

      <sortByRowID>true</sortByRowID>

    </output>

    <output name="StatisticsOutput" active="1">

      <size>${item_size}</size><!-- a counter per item .. initialize later-->

      <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>

      <outputDir>output/</outputDir>

      <fileEnding>.csv</fileEnding>

      <delimiter>,</delimiter>

      <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->

      </header>

      <footer></footer>

bigbench-schema.xml 设置了很多参数，有跟表的规模有关的，比如每张表的大小（记录的条数）;绝大多数是跟表的字段有关的，比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的，以及限制条件。示例如下：

生成的数据大小由 SCALE_FACTOR（-f） 决定。如 -f 1，则生成的数据总大小约为1G；-f 100，则生成的数据总大小约为100G。那么SCALE_FACTOR（-f） 是怎么精确控制生成的数据的大小呢？

原因是 SCALE_FACTOR（-f） 决定了每张表的记录数。如下，customer 表的记录数为 100000.0d * ${SF_sqrt}，即如果 -f 1 则 customer 表的记录数为 100000*sqrt(1)= 10万条 ;如果 -f 100 则 customer 表的记录数为 100000*sqrt(100)= 100万条

<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>

<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>

<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>

<property name="${gender_likelihood}" type="double">0.5</property>

<property name="${married_likelihood}" type="double">0.3</property>

<property name="${WP_LINK_MIN}" type="double">2</property>

<property name="${WP_LINK_MAX}" type="double">25</property>

  <field name="d_date" size="13" type="CHAR" primary="false">

   <gen_DateTime>

     <disableRng>true</disableRng>

     <useFixedStepSize>true</useFixedStepSize>

     <startDate>${date_dim_begin_date}</startDate>

     <endDate>${date_dim_end_date}</endDate>

     <outputFormat>yyyy-MM-dd</outputFormat>

    </gen_DateTime>

  </field>

  <field name="t_time_id" size="16" type="CHAR" primary="false">

   <gen_ConvertNumberToString>

    <gen_Id/>

    <size>16.0</size>

    <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>

   </gen_ConvertNumberToString>

  </field>

<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">

   <gen_Null probability="${NULL_CHANCE}">

   <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />

   </gen_Null>

  </field>

dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件，表里每一条记录的各个字段应该是从这些字典里生成的。

extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar、commons-net-3.3.jar、xml-apis.jar和log4j-1.2.15.jar等

总结：

pdgf.jar根据bigbench-generation.xml 和 bigbench-schema.xml两个文件里的配置（表名、字段名、表的记录条数、每个字段生成的规则），从 dicts 目录下对应的 .dict

文件获取表中每一条记录、每个字段的值，生成原始数据。

customer 表里的某条记录如下：

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com   gdMmGdU9

如果执行 TPCx-BB 测试时指定 -f 1（SCALE_FACTOR = 1） 则最终生成的原始数据总大小约为 1G(977M+8.6M)

[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data

12.7 M   38.0 M   /user/root/benchmarks/bigbench/data/customer

5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/customer_address

74.2 M   222.5 M  /user/root/benchmarks/bigbench/data/customer_demographics

14.7 M   44.0 M   /user/root/benchmarks/bigbench/data/date_dim

151.5 K  454.4 K  /user/root/benchmarks/bigbench/data/household_demographics

327      981      /user/root/benchmarks/bigbench/data/income_band

405.3 M  1.2 G    /user/root/benchmarks/bigbench/data/inventory

6.5 M    19.5 M   /user/root/benchmarks/bigbench/data/item

4.0 M    12.0 M   /user/root/benchmarks/bigbench/data/item_marketprices

53.7 M   161.2 M  /user/root/benchmarks/bigbench/data/product_reviews

45.3 K   135.9 K  /user/root/benchmarks/bigbench/data/promotion

3.0 K    9.1 K    /user/root/benchmarks/bigbench/data/reason

1.2 K    3.6 K    /user/root/benchmarks/bigbench/data/ship_mode

3.3 K    9.9 K    /user/root/benchmarks/bigbench/data/store

4.1 M    12.4 M   /user/root/benchmarks/bigbench/data/store_returns

88.5 M   265.4 M  /user/root/benchmarks/bigbench/data/store_sales

4.9 M    14.6 M   /user/root/benchmarks/bigbench/data/time_dim

584      1.7 K    /user/root/benchmarks/bigbench/data/warehouse

170.4 M  511.3 M  /user/root/benchmarks/bigbench/data/web_clickstreams

7.9 K    23.6 K   /user/root/benchmarks/bigbench/data/web_page

5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/web_returns

127.6 M  382.8 M  /user/root/benchmarks/bigbench/data/web_sales

8.6 K    25.9 K   /user/root/benchmarks/bigbench/data/web_site

执行流程

要执行TPCx-BB测试，首先需要切换到TPCx-BB源程序的目录下，然后进入bin目录，执行以下语句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

其中，-f、-m、-s、-j都是参数，用户可根据集群的性能以及自己的需求来设置。如果不指定，则使用默认值，默认值在 conf 目录下的 userSetting.conf 文件指定，如下：

export BIG_BENCH_DEFAULT_DATABASE="bigbench"

export BIG_BENCH_DEFAULT_ENGINE="hive"

export BIG_BENCH_DEFAULT_MAP_TASKS="80"

export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"

export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"

export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"

默认 MAP_TASKS 为 80（-m 80）、SCALE_FACTOR 为 1000（-f 1000）、NUMBER_OF_PARALLEL_STREAMS 为 2（-s 2）。

所有可选参数及其意义如下：

General options:

-d  使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)

-e  使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)

-f  数据集的规模因子（scale factor） (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)

-h  显示帮助

-m  数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"

-s  并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)

Driver specific options:

-a  伪装模式执行

-b  执行期间将调用的bash脚本在标准输出中打印出来

-i  指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)

-j  指定需要执行的查询 (默认：1-30共30个查询均执行)"

-U  解锁专家模式

若指定了-U,即解锁了专家模式，则：

echo "EXPERT MODE ACTIVE"

echo "WARNING - INTERNAL USE ONLY:"

echo "Only set manually if you know what you are doing!"

echo "Ignoring them is probably the best solution"

echo "Running individual modules:"

echo "Usage: `basename $0` module [options]"

-D  指定需要debug的查询部分. 大部分查询都只有一个单独的部分

-p  需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"

-q  指定需要执行哪个查询（只能指定一个）

-t  指定执行该查询时用第哪个stream

-v  metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"

-w  metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"

-y  含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"

-z  含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"

List of available modules:

    $BIG_BENCH_ENGINE_BIN_DIR

回到刚刚执行TPCx-BB测试的语句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

bigBench

bigBench是主脚本，runBenchmark是module。

bigBench 里设置了很多环境变量（包括路径、引擎、STREAM数等等），因为后面调用 runBigBench.jar 的时候需要在Java程序里读取这些环境变量。

bigBench 前面都是在做一些基本工作，如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmark 的 runModule() 方法：

设置基本路径

export BIG_BENCH_VERSION="1.0"

export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"

export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"

export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"

export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"

export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"

指定 core-site.xml 和 hdfs-site.xml 的路径

数据生成时要用到Hadoop集群，生成在hdfs上

export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"

export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"

```

赋予整个包下所有可执行文件权限（.sh/.jar/.py）

find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '.py' -exec chmod 755 {} +

```

设置 userSetting.conf 的路径并 source

USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"

if [ ! -f "$USER_SETTINGS" ]

then

  echo "User settings file $USER_SETTINGS not found"

  exit 1

else

  source "$USER_SETTINGS"

fi

解析输入参数和选项并根据选项的内容作设置

第一个参数必须是module_name

如果没有输入参数或者第一个参数以"-"开头，说明用户没有输入需要运行的module。

if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]

then

  export MODULE_NAME=""

  SHOW_HELP="1"

else

  export MODULE_NAME="$1"

  shift

fi

export LIST_OF_USER_OPTIONS="$@"

解析用户输入的参数

根据用户输入的参数来设置环境变量

```bash

while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do

case "$OPT" in

# script options

#echo "-d was triggered, Parameter: $OPTARG" >&2

USER_DATABASE="$OPTARG"

;;

#echo "-D was triggered, Parameter: $OPTARG" >&2

DEBUG_QUERY_PART="$OPTARG"

;;

#echo "-e was triggered, Parameter: $OPTARG" >&2

USER_ENGINE="$OPTARG"

;;

#echo "-f was triggered, Parameter: $OPTARG" >&2

USER_SCALE_FACTOR="$OPTARG"

;;

#echo "-h was triggered, Parameter: $OPTARG" >&2

SHOW_HELP="1"

;;

#echo "-m was triggered, Parameter: $OPTARG" >&2

USER_MAP_TASKS="$OPTARG"

;;

#echo "-p was triggered, Parameter: $OPTARG" >&2

USER_BENCHMARK_PHASE="$OPTARG"

;;

#echo "-q was triggered, Parameter: $OPTARG" >&2

QUERY_NUMBER="$OPTARG"

;;

#echo "-t was triggered, Parameter: $OPTARG" >&2

USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"

;;

#echo "-s was triggered, Parameter: $OPTARG" >&2

USER_STREAM_NUMBER="$OPTARG"

;;

#echo "-U was triggered, Parameter: $OPTARG" >&2

USER_EXPERT_MODE="1"

;;

#echo "-v was triggered, Parameter: $OPTARG" >&2

USER_POPULATE_FILE="$OPTARG"

;;

#echo "-w was triggered, Parameter: $OPTARG" >&2

USER_REFRESH_FILE="$OPTARG"

;;

#echo "-y was triggered, Parameter: $OPTARG" >&2

USER_QUERY_PARAMS_FILE="$OPTARG"

;;

#echo "-z was triggered, Parameter: $OPTARG" >&2

USER_ENGINE_SETTINGS_FILE="$OPTARG"

;;

# driver options

#echo "-a was triggered, Parameter: $OPTARG" >&2

export USER_PRETEND_MODE="1"

;;

#echo "-b was triggered, Parameter: $OPTARG" >&2

export USER_PRINT_STD_OUT="1"

;;

#echo "-i was triggered, Parameter: $OPTARG" >&2

export USER_DRIVER_WORKLOAD="$OPTARG"

;;

#echo "-j was triggered, Parameter: $OPTARG" >&2

export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"

;;

echo "Invalid option: -$OPTARG" >&2

exit 1

;;

一篇文章看懂TPCx-BB（大数据基准测试工具）源码

代码结构

数据生成

执行流程

bigBench

一篇文章 看懂TPCx-BB（大数据基准测试工具）源码的相关教程结束。

相关推荐

苹果开发出新款AI：能"看懂"屏幕上内容并用语音回复

一张图看懂鸿蒙安卓和ios区别

一篇文章带你入门HBase

html+css实现二级导航栏效果，简单易看懂噢！

1分钟看懂log4j 配置自己想要的日志信息

Java 集合详解 | 一篇文章解决Java 三大集合

一篇文章告诉你什么是Java内存模型

C# 多线程(18)：一篇文章就理解async和await

热门推荐

热门专题