👩🏻‍✈️ 🏚️ 😯 Apache Bigtop and the choice of Hadoop distribution today 🎏 ♥️ 🥦

It's probably no secret that last year was a year of great change for Apache Hadoop. Last year, Cloudera and Hortonworks merged (essentially a takeover of the second), and Mapr, due to serious financial problems, was sold to Hewlett Packard. And if a few years earlier, in the case of on-premises installations, the choice often had to be made between Cloudera and Hortonworks, today, alas, we did not have this choice. Another surprise was the fact that since February of this year, Cloudera announced the termination of the release of binary assemblies of its distribution into a public repository, and now they are available only by paid subscription. Of course, the ability to download the latest versions of CDH and HDP, released before the end of 2019, is still there, and support for them is expected for one to two years. But what to do next? For those,who previously paid for the subscription, nothing has changed. And for those who do not want to switch to a paid version of the distribution kit, but want to be able to get the latest versions of cluster components, as well as patches and other updates, we prepared this article. In it, we will consider possible ways out of this situation.

. , . ? Arenadata Hadoop, , . Vanilla Hadoop, , “” Apache Bigtop. ? .

Arenadata Hadoop

This is a completely new and, so far, little-known distribution of domestic development. Unfortunately, at the moment there is only this article on Habré about it .

More information can be found on the official website of the project. Latest distributions are based on Hadoop 3.1.2 for version 3, and 2.8.5 for version 2.

Information on roadmap can be found here .

Arenadata Cluster Manager Interface Arenadata's

key product is Arenadata Cluster Manager (ADCM), which is used to install, configure and monitor various software solutions of the company. ADCM is free, and its functionality is expanded by adding bundles to it, which are a set of ansible-playbooks. Bundles are divided into two types: enterprise and community. The latter are available for free download from Arenadata. It is also possible to develop your own bundle and connect it to ADCM.

For deployment and management of Hadoop 3, a community version of the bundle in conjunction with ADCM is offered, and for hadoop 2 there is only Apache Ambarias an alternative. As for the repositories with packages, they are open for public access, they can be downloaded and installed in the usual way for all cluster components. In general, the distribution looks very interesting. I’m sure there are those who are used to such solutions as Cloudera Manager and Ambari, and who will like ADCM itself. For some, the fact that the distribution kit is included in the import substitution software registry will be a huge plus .

If we talk about the cons, they will be the same as for all other Hadoop distributions. Namely:

The so-called "vendor lock-in". Using the examples of Cloudera and Hortonworks, we already realized that there is always a risk of changing company policies.
Significant lag behind Apache upstream.

Vanilla hadoop

As you know, Hadoop is not a monolithic product, but, in fact, a whole galaxy of services around its distributed HDFS file system. Few people will need one file cluster. One needs Hive, and the other Presto, and there is HBase and Phoenix, Spark is increasingly used. Oozie, Sqoop, and Flume are sometimes found to orchestrate and download data. And if the security issue arises, Kerberos in conjunction with Ranger is immediately remembered.

Binary versions of Hadoop components are available on the website of each ecosystem project in the form of tarballs. They can be downloaded and installation started, but with one condition: in addition to self-assembly of packages from “raw” binaries, which you most likely want to run, you will not have any confidence in the compatibility of the downloaded versions of the components with each other. The preferred option is to build using Apache Bigtop. Bigtop allows you to build from Apache's maven repositories, run tests, and build packages. But, which is very important for us, Bigtop will collect those versions of components that will be compatible with each other. We will talk about it in more detail below.

Apache bigtop

Apache Bigtop is a tool for building, packaging and testing a number of
open source projects, such as, for example, Hadoop and Greenplum. Bigtop has many
releases. At the time of writing, the latest stable release was version 1.4,
and in master was 1.5. Different versions of releases use different versions of
components. For example, for 1.4, the Hadoop core components are version 2.8.5, and in master
2.10.0. The composition of the supported components is also changing. Something outdated and
non-renewable leaves, and in its place comes something new, more in demand, and
not necessarily something from the family of Apache itself.

Bigtop also has many forks .

When we began to get acquainted with Bigtop, we were primarily surprised by its modest, in comparison with other Apache projects, prevalence and fame, as well as a very small community. It follows that there is a minimum of product information, and a search for solutions to problems that have arisen through forums and newsletters may not produce anything at all. At first, it turned out to be a difficult task for us to complete the assembly of the distribution kit due to the features of the tool itself, but we will talk about this a little later.

As a teaser, for those who once visited such Linux-universe projects as Gentoo and LFS, it may seem nostalgically pleasant to work with this thing and recall those "old" times when we ourselves searched for (or even wrote) ebuilds and regularly rebuilt with new patches mozilla.

The big plus of Bigtop can be considered the openness and versatility of the tools on which it is based. Its foundation is Gradle and Apache Maven. Gradle is reasonably well-known as the tool Google collects Android for. It is flexible, and, as they say, “tested in battle”. Maven is a full-time tool for building projects in Apache itself, and since most of its products are released through Maven, it could not do without it. It is worth paying attention to POM (project object model) - a “fundamental” xml-file with a description of everything necessary for Maven to work with your project, around which all work is built. It is in
the Maven part that some obstacles arise that are usually encountered for the first time when they take up Bigtop.

Practice

So where to start? We go to the download page and download the latest stable version as an archive. Binary artifacts collected by Bigtop can also be found there. By the way, of the common package managers, YUM and APT are supported.

Alternatively, you can download the latest stable release directly from
github:

$ git clone --branch branch-1.4 https://github.com/apache/bigtop.git

Cloning in the "bigtop" ...

remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 40217 (delta 14), reused 10 (delta 1), pack-reused 40171
 : 100% (40217/40217), 43.54 MiB | 1.05 MiB/s, .
 : 100% (20503/20503), .
Updating files: 100% (1998/1998), .

The resulting ./bigtop directory looks something like this:

./bigtop-bigpetstore- demo applications, synthetic examples
./bigtop-ci- CI tools, jenkins
./bigtop-data-generators- data generation, synthetics, for smoke tests, etc.
./bigtop-deploy- deployment tools
./bigtop-packages- configs, scripts, patches for assembly, the main part of the tool
./bigtop-test-framework- testing framework
./bigtop-tests- tests themselves, stress and smoke
./bigtop_toolchain- environment for assembly, preparation of the environment for the tool to work
./build- working directory of the assembly
./dl- directory for downloaded sources
./docker- assembly in docker- images, testing
./gradle- gradle config
./output - the directory into which assembly artifacts get into
./provisioner- provisioning

The most interesting for us at this stage is the main config./bigtop/bigtop.bom, in which we see all supported components with versions. This is where we can specify a different version of the product (if suddenly we want to try to build it) or a version of the assembly (if, for example, we added a significant patch).

Also of great interest is the subdirectory ./bigtop/bigtop-packages, which is directly related to the assembly process of components and packages with them.

So, we downloaded the archive, unpacked it or made a clone with github, can we start the assembly?

No, first prepare the environment.

Environment preparation

And here a small digression is needed. To build almost any more or less complex product, you need a certain environment - in our case, it is JDK, the same shared libraries, header files, etc., tools, for example, ant, ivy2 and much more. One of the options to get the environment necessary for Bigtop is to install the necessary components on the assembly host. I may be mistaken in the chronology, but it seems that from version 1.0 there was also a build option in preconfigured and accessible docker images, you can find them here.

As for the preparation of the environment, there is an assistant for this - Puppet.

You can use the following commands, the launch is done from the root directory of the
tool,./bigtop:

./gradlew toolchain
./gradlew toolchain-devtools
./gradlew toolchain-puppetmodules

Or directly through puppet:

puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::installer"
puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::deployment-tools"
puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::development-tools"

Unfortunately, difficulties may arise already at this stage. The general advice here is to use a supported distribution, up to date on the build host, or try the path with docker.

Assembly

What can we try to collect? The answer to this question will give the output of the command

./gradlew tasks

The Package tasks section has a number of products that are the final artifacts of Bigtop.
They can be identified by the suffix -rpm or -pkg-ind (in case of assembly
in docker). In our case, the most interesting is Hadoop.

Let's try to build in the environment of our build server:

./gradlew hadoop-rpm

Bigtop itself will download the necessary sources needed for a particular component and begin building. Thus, the tool is tied to Maven repositories and other sources, that is, it needs Internet access.

In the process, a standard output is formed. Sometimes you can understand from it and the error messages what went wrong. And sometimes you need more information. In this case, it is worth adding the arguments --info or --debug, and may also be useful –stacktrace. There is a convenient way to generate a data set for subsequent reference to the mailing lists, the key --scan.

With it, bigtop will collect all the information and put it in gradle, after which it will give a link,
after which a competent person will be able to understand why the assembly failed.
You need to keep in mind that this option may make information undesirable for you public, such as usernames, nodes, environment variables, etc., so be careful.

Often errors are the result of the inability to obtain any components necessary for assembly. As a rule, you can fix the problem by creating a patch to fix something in the source, for example, the address in pom.xml in the root directory of the source. This is done by creating and placing it in the appropriate ./bigtop/bigtop-packages/src/common/oozie/patch directory , for example, in the form of patch2-fix.diff.

--- a/pom.xml
+++ b/pom.xml
@@ -136,7 +136,7 @@
<repositories>
<repository>
<id>central</id>
- <url>http://repo1.maven.org/maven2</url>
+ <url>https://repo1.maven.org/maven2</url>
<snapshots>
<enabled>false</enabled>
</snapshots>

Most likely, at the time of reading this article, the above correction you will not have to do yourself.

When introducing any patches and edits into the assembly mechanism, you may need to “reset” the assembly through the cleanup command:

./gradlew hadoop-clean
> Task :hadoop_vardefines
> Task :hadoop-clean
BUILD SUCCESSFUL in 5s
2 actionable tasks: 2 executed

This operation will roll back all changes in the assembly of this component, after which the assembly will be performed again. This time we will try to build the project in a docker image:

./gradlew -POS=centos-7 -Pprefix=1.2.1 hadoop-pkg-ind
> Task :hadoop-pkg-ind
Building 1.2.1 hadoop-pkg on centos-7 in Docker...
+++ dirname ./bigtop-ci/build.sh
++ cd ./bigtop-ci/..
++ pwd
+ BIGTOP_HOME=/tmp/bigtop
+ '[' 6 -eq 0 ']'
+ [[ 6 -gt 0 ]]
+ key=--prefix
+ case $key in
+ PREFIX=1.2.1
+ shift
+ shift
+ [[ 4 -gt 0 ]]
+ key=--os
+ case $key in
+ OS=centos-7
+ shift
+ shift
+ [[ 2 -gt 0 ]]
+ key=--target
+ case $key in
+ TARGET=hadoop-pkg
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z x ']'
+ '[' -z x ']'
+ '[' '' == true ']'
+ IMAGE_NAME=bigtop/slaves:1.2.1-centos-7
++ uname -m
+ ARCH=x86_64
+ '[' x86_64 '!=' x86_64 ']'
++ docker run -d bigtop/slaves:1.2.1-centos-7 /sbin/init
+
CONTAINER_ID=0ce5ac5ca955b822a3e6c5eb3f477f0a152cd27d5487680f77e33fbe66b5bed8
+ trap 'docker rm -f
0ce5ac5ca955b822a3e6c5eb3f477f0a152cd27d5487680f77e33fbe66b5bed8' EXIT
....
 
....
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-mapreduce-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-namenode-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-secondarynamenode-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-zkfc-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-journalnode-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-datanode-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-httpfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-resourcemanager-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-nodemanager-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-proxyserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-timelineserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-mapreduce-historyserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-client-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-conf-pseudo-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-doc-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-libhdfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-libhdfs-devel-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-fuse-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-debuginfo-2.8.5-1.el7.x86_64.rpm
+ umask 022
+ cd /bigtop/build/hadoop/rpm//BUILD
+ cd hadoop-2.8.5-src
+ /usr/bin/rm -rf /bigtop/build/hadoop/rpm/BUILDROOT/hadoop-2.8.5-1.el7.x86_64
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.uQ2FCn
+ exit 0
+ umask 022
Executing(--clean): /bin/sh -e /var/tmp/rpm-tmp.CwDb22
+ cd /bigtop/build/hadoop/rpm//BUILD
+ rm -rf hadoop-2.8.5-src
+ exit 0
[ant:touch] Creating /bigtop/build/hadoop/.rpm
:hadoop-rpm (Thread[Task worker for ':',5,main]) completed. Took 38 mins 1.151 secs.
:hadoop-pkg (Thread[Task worker for ':',5,main]) started.
> Task :hadoop-pkg
Task ':hadoop-pkg' is not up-to-date because:
Task has not declared any outputs despite executing actions.
:hadoop-pkg (Thread[Task worker for ':',5,main]) completed. Took 0.0 secs.
BUILD SUCCESSFUL in 40m 37s
6 actionable tasks: 6 executed
+ RESULT=0
+ mkdir -p output
+ docker cp
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb:/bigtop/build .
+ docker cp
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb:/bigtop/output .
+ docker rm -f ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
+ '[' 0 -ne 0 ']'
+ docker rm -f ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
Error: No such container:
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
BUILD SUCCESSFUL in 41m 24s
1 actionable task: 1 executed

The build was done under CentOS, but you can also do it under Ubuntu:

./gradlew -POS=ubuntu-16.04 -Pprefix=1.2.1 hadoop-pkg-ind

In addition to assembling packages for various Linux distributions, the tool can create a repository with assembled packages, for example:

./gradlew yum

You might also recall smoke tests and deployment in docker.

Create a cluster of three nodes:

./gradlew -Pnum_instances=3 docker-provisioner

Run smoke tests in a cluster of three nodes:

./gradlew -Pnum_instances=3 -Prun_smoke_tests docker-provisioner

Delete cluster:

./gradlew docker-provisioner-destroy

Get commands for connecting inside docker containers:

./gradlew docker-provisioner-ssh

Show status:

./gradlew docker-provisioner-status

You can read more about Deployment tasks in the documentation.

If we talk about tests, then there are quite a large number of them, mainly smoke and integration ones. Their analysis is beyond the scope of this article. I can only say that building the distribution kit is not as difficult as it might seem at first glance. We managed to collect and pass all the components that we use in our products in the prod, and we also had no problems with their deployment and performing basic operations in a test environment.

In addition to the existing components in Bigtop, it is possible to add something else, even your own software development. All this is perfectly automated and fits into the CI / CD concept.

Conclusion

Obviously, a distribution built in this way should not be immediately sent to production. You need to understand that if there is a real need to build and maintain your distribution, then you need to invest financially and time.

Nevertheless, in combination with the right approach and a professional team, it is quite possible to do without commercial solutions.

It is important to note that the Bigtop project itself needs to be developed and it seems that today there is no active development in it. Also, the prospect of the appearance of Hadoop 3 in it is not clear. By the way, if you have a real need for building Hadoop 3, you can look at the fork from Arenadata, in which, in addition to standard
components, there are a number of additional components (Ranger, Knox, NiFi).

As for Rostelecom, for us Bigtop is one of the options considered today. Whether we stop it or not, time will tell.

Appendix

To include a new component in the assembly, you need to add its description in bigtop.bom and ./bigtop-packages. You can try to do this by analogy with existing components. Try to figure it out. It is not as difficult as it seems at first glance.

What do you think? We will be glad to see your opinion in the comments and thank you for your attention!

This article was prepared by the Rostelecom data management team

Apache Bigtop and the choice of Hadoop distribution today