AWS EMR is a Big Data framework for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase. AWS EMR it’s completely integrated with AWS Big Data ecosystem, in particular with S3 Bucket for the data storage. It’s one of the most used service on AWS, related to Big Data Platform, thanks to its ease and reliable functions based on Clusters.
Clusters are collection of Elastic Compute Cloud (Amazon EC2) instances, and every instance is called a node.
The node types in Amazon EMR are as follows:
• Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.
• Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
• Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional
Amazon EMR integrates with other AWS services to provide capabilities and functionality related to networking, storage, security for your cluster.
With EMR, you can quickly run your module in a Cluster composed by multiple instance groups. In this way, for example, you can use On-Demand Instances in one group for guaranteed processing power together with Spot Instances in another group to have your jobs completed faster and for lower costs. Moreover, EMR Clusters are scalable in any moment, in order to run algorithm always in a tailored environment.
Additionally, EMR allows to use different storage layers, HDFS or EMRFS. In the first case, data are stored inside HDFS into Core Node of your clusters, avoiding to store permanently these data. In the second case, you can store data on S3 as data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster.
Amazon EMR monitors nodes in your cluster and automatically terminates and replaces an instance in case of failure.
Amazon EMR provides configuration options that control how your cluster is terminated—automatically or manually. If you configure your cluster to be automatically terminated, it is terminated after all the steps complete. This is referred to as a transient cluster. However, you can configure the cluster to continue running after processing completes so that you can choose to terminate it manually when you no longer need it. Or, you can create a cluster, interact with the installed applications directly, and then manually terminate the cluster when you no longer need it. The clusters in these examples are referred to as long-running clusters.
AWS EMR is easy to deploy, it’s necessary only to configure the number and the type of nodes and the cluster is up and running in few minutes. Also deployment of application is very easy, and can be automatized using CI/CD tools like Jenkins.
AWS EMR integrates with CloudWatch to track performance metrics for the cluster and jobs within the cluster. You can configure alarms based on a variety of metrics such as whether the cluster is idle or the percentage of storage used.
Amazon EMR pricing depends on the instance type and number of EC2 instances that you deploy and the region in which you launch your cluster. On-demand pricing offers low rates, but you can reduce the cost even further by purchasing Reserved Instances or Spot Instances.
Amazon EMR integrates with IAM to manage permissions. You define permissions using IAM policies, which you attach to IAM users or IAM groups. The permissions that you define in the policy determine the actions that those users or members of the group can perform and the resources that they can access.
AWS EMR governance is possible using a centralized dashboard that provide customer the possibility to manage (create, delete, scale, configure,.. ) clusters, allowing users to have always a clear vision of costs and power of the cluster.
Moreover, using EMR in collaboration with Glue, it’s possible to create a centralized Data Catalog, where you can consume metadata associated to data and tables used by EMR.
AWS EMR is totally integrated with AWS Cloudwatch. Thanks to this feature, we can collect logs and metrics related to EMR, and use them to constantly monitor the pipelines.
An advantage that brings AWS EMR is the possibility to use Spot Instances. Spot Instances are unused Amazon EC2 capacity that you bid on; the price you pay is determined by the supply and demand for Spot Instances. The cost of using Spot Instances can be 80% less than using On-Demand Instances.
Not all the workloads can be executed on Spot Instances, in this case we can use On-demand machines, that can be shared among several small jobs or teams.
REQUIREMENTS & BUSINESS USE CASE
Understand key business challenges and goals, in order to identify gaps and opportunity, and plan current and future state
During workshop phase, we perform a Technical & Opportunity assessment, planning technical deep dive session, in order to identify migration success criteria, business & IT Data Lake outcomes
Scope of the Pilot phase is to create a simple Pilot of the target solution, in order to allow customers to have a concrete way to test the solution. We define target architecture & component level mapping, according to requirements collected in previous phases, and execute incremental data migration and automation. After the UAT step, the Pilot is ready to Go Live!
Finally, the phase where we implement the final solution, split into Waves that guarantee a continue release of the solution. We define full migration strategy and schedule, and the application code migration with Dual Target approach. Later, we can start with waves of implementation, including bulk import/export and the validation and audit of the solution. After the Test & UAT phases, we are ready for a successful GO LIVE!
Data Reply is the Reply group company offering a broad range of advanced analytics and AI-powered data services. We operate across different industries and business functions, working directly with executive level professionals and Chief Officers enabling them to achieve meaningful outcomes through effective use of data.
We have a consolidated experience in designing and building cloud solutions: we support companies with either lift-and-shift solutions and with cloud native architectures. We help our customers in developing and adopting holistic Big Data architectures and implementing ML and AI models in a manner that is repeatable, efficient, scalable, simple and yet secure.
We supports companies in combinatorial optimization processes with Quantum and Accelerated Computing techniques that enable an engine with high computational performances.