🤳 ⛅️ ⁉️ Top 10 misconceptions about porting Hadoop to the cloud 👨🏾‍🔬 🎼 👡

Many companies and corporations want to use the cloud to process data for obvious reasons: flexibility, scalability, you can only pay for what you use and so on.

In fact, transferring a project with a multicomponent data processing system, of the Petabyte scale, from the on-premises environment to the cloud is a solid “but”. There are many products for migration: Hadoop , Hive , Yarn , Spark , Kafka , Zookeeper , Jupyter , Zeppelin . Given the fundamental difference in the environment, it is easy to get lost and make mistakes in this variety.

In this article I will talk about common misconceptions and give some tips on quality migration to the cloud. Personally, I use AWS , but all the tricks are relevant for other providers with similar solutions, for example, for Azure or GCP .

1. Copying data to the cloud is easy

Transferring several petabytes of data to the public cloud (for example, S3 ), which in our case will work as a data lake, is not an easy task. This can be very time consuming and resource intensive.

Despite the huge number of solutions, commercial and open source, I did not find a single one that would cover all the needs:

transmission
data integration
data verification
reporting

If a certain part of the data is mostly static or moderately dynamic, you can use a solution like AWS Snowball , which allows you to copy arrays to a physical device. The data will be downloaded from your local network, after which the drive will be sent back to the AWS data center and the data will be poured into the S3 storage .

Hidden text

, , AWS.

It’s good practice to divide data transfer into two phases. After most of the array has been sent and uploaded to the repository, use a direct connection from the cloud provider to unload the remainder. You can use the Hadoop DistCP or Kafka Mirroring methods for this . Both methods have their own nuances. DistCP requires constant planning and deep tuning, in addition, not all objects can be placed in black and white lists. Kafka MirrorMaker , in addition to deep tuning, needs to export metrics through the JMX management extension to measure throughput, latency, and overall stability.

Hidden text

. — , .

2. Cloud works just like local storage

Local storage and cloud storage are not the same thing. A good example is Zookeeper and Kafka . The ZK client library caches the allowed addresses of ZK servers for the entire service life: this is a big problem for deployment in the cloud, which will require crutches — the static ENI network interfaces for ZK servers .

For performance monitoring, it’s a good idea to run a series of non-functional NFT tests in the cloud infrastructure to make sure that the settings and configuration will cope with your workloads.

Hidden text

, , .

3. Object storage 100% replaces HDFS

Separating storage and computing layers is a great idea, but there is a caveat.

With the exception of the Google Cloud the Storage , which uses a strong data consistency (strong consistency) , most of the other storage facilities operate on the model of "consistency eventually» (Eventually consistent) . This means that they can be used to enter raw and processed data, and to output results, but not as a temporary storage.

Hidden text

, HDFS.

4. You can deploy cloud infrastructure from the user interface

For a small test environment, this can be easy, but the higher the infrastructure requirements, the more likely it is to write code. You might want to have several environments (Dev, QA, Prod) . This can be implemented using CloudFormation and Terraform , but copying the necessary pieces of code will fail, you will have to redo a lot for yourself.

Hidden text

— CI/CD . , .

5. For correct visibility in the cloud you just need to use $ {SaaS_name}

Good visibility (logging and monitoring) of the old and new environment is a critical condition for successful migration.

This can be difficult due to the use of different systems in environments. For example, Prometheus and ELK for the local environment, and NewRelic and Sumologic for the cloud. Even if one SaaS solution is applied in both environments, it is difficult to scale.

Hidden text

, ( , , JMX, , ).

6. The cloud scales to infinity

Users often rejoice as children when they learn about the automatic scaling function and think that they will immediately apply it on their data processing platforms. It is really easy to configure for EMR nodes without HDFS , but it will require additional knowledge for persistent storage (for example, the Kafka software broker ). Before switching all traffic to the cloud infrastructure, you need to check the current resource limits: the number of class instances, disks, you also need to preheat the load balancers. Without such training, the working potential cannot be used as it should.

Hidden text

, — , — .

7. I just move my infrastructure unchanged

Indeed, instead of focusing solely on the capabilities of a potential service provider, it is better to focus on your own repositories, for example, DynamoDB . But do not forget about services compatible with the API. Alternatively, you can use the Amazon RDS cloud service for the Hive Metastore database .

Another good example is the EMR cloud-optimized big data platform . At first glance, simple, it requires fine-tuning using post-installation scripts. You can customize heap size , third-party archives JAR , UDFsecurity add-ons. Also note that there is still no way to provide high availability (HA) for the main NameNode or YARN ResourceManager nodes .

Hidden text

, , .

8. Transfer Hadoop / Spark tasks to the cloud - it's easy

Not really. To successfully transfer tasks, you need to have a clear idea of your business logic and pipelines: from the initial receipt of raw data to high-quality arrays. Everything becomes even more complicated when the results of the pipelines X and Y are the input data of the pipeline Z. All components of the flows and relationships should be displayed as clearly as possible. This can be implemented using DAG .

Hidden text

SLA.

9. Cloud will reduce operating costs and staff budget

Own equipment requires physical costs and salaries for employees. After moving to the cloud, all the costs will not disappear: you still have to react to the needs of the business and hire people who will be involved in the development, support, troubleshooting, budget planning. You will also need to invest in software and tools for the new infrastructure.

The staff should be a person who understands how new technologies work. This implies a highly qualified employee. Therefore, even taking into account the reduction in staff, you can spend as much, if not more, on the salary of one good specialist.

Hidden text

— (, EMR), . , , .

10. No-Ops close ...

No-Ops is the dream of any business. A fully automated environment without the need for services and products from third parties. Is it possible?

A modest team of several people is relevant only for small companies whose activities are not directly related to data. Everyone else will need at least a specialist who integrates and packs all the systems, compares them, automates, provides visibility and eliminates all the bugs that arise along the way.

Hidden text

Data-Ops , .

To summarize. Transferring data processing pipelines to the cloud is good. For migration to work as it should, you need to carefully plan the process, taking into account all the pitfalls described above. Think a few steps forward and everything will work out.

Top 10 misconceptions about porting Hadoop to the cloud

1. Copying data to the cloud is easy

2. Cloud works just like local storage

3. Object storage 100% replaces HDFS

4. You can deploy cloud infrastructure from the user interface

5. For correct visibility in the cloud you just need to use $ {SaaS_name}

6. The cloud scales to infinity

7. I just move my infrastructure unchanged

8. Transfer Hadoop / Spark tasks to the cloud - it's easy

9. Cloud will reduce operating costs and staff budget

10. No-Ops close ...

More articles: