After you complete deployment of the Hadoop distribution, you can create big data clusters to process data. You can create multiple clusters in your Big Data Extensions environment but your environment must meet all prerequisites and have adequate resources.

Start the Big Data Extensions vApp.

Install theBig Data Extensions plug-in.

Connect to a Serengeti Management Server.

Configure one or more Hadoop distributions.

Understand the topology configuration options that you want to use with your cluster.

1

Use the vSphere Web Client to log in to vCenter Server.

2

Select Big Data Extensions > Big Data Clusters.

3

In the Objects tab, click New Big Data Cluster.

4

Follow the prompts to create the new cluster. The table describes the information to enter for the cluster that you want to create.

Option

Description

Hadoop cluster name

Type a name to identify the cluster.

The only valid characters for cluster names are alphanumeric and underscores. When you choose the cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be < 80 characters.

Application manager

Select an application manager. The list contains the default application manager and the application managers that you added to your Big Data Extensions environment. For example, Cloudera Manager and Ambari.

Node template

Select a node template. The list contains all templates available in the Big Data Extensions vApp.

Hadoop distro

Select the Hadoop distribution. The list contains the default Apache Bigtop distribution for Big Data Extensions and the distributions that you added to your Big Data Extensions environment. The distribution names match the value of the --name parameter that was passed to the config-distro.rb script when the Hadoop distribution was configured. For example, cdh5 and mapr.

Note

To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD 1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.

Local repository URL

Type a local repository URL. This is an optional item for all of application managers. If you specify a local repository URL, the Cloudera Manager or Ambari application manager downloads the required Red Hat Package Managers (RPMs) from the local repository that you specify instead of from a remote repository, which could affect your system performance.

Deployment type

Select the type of cluster you want to create.

Basic Hadoop Cluster

Basic HBase Cluster

Compute Only Hadoop Cluster

Compute Workers Only Cluster

HBase Only Cluster

Data/Compute Separation Hadoop Cluster

Customized

The type of cluster you create determines the available node group selections.

If you select Customize, you can load an existing cluster specification file.

DataMaster Node Group

The DataMaster node is a virtual machine that runs the Hadoop NameNode service. This node manages HDFS data and assigns tasks to Hadoop TaskTracker services deployed in the worker node group.

Select a resource template from the drop-down menu, or select Customize to customize a resource template.

For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.

ComputeMaster Node Group

The ComputeMaster node is a virtual machine that runs the Hadoop JobTracker service. This node assigns tasks to Hadoop TaskTracker services deployed in the worker node group.

Select a resource template from the drop-down menu, or select Customize to customize a resource template.

For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.

HBaseMaster Node Group (HBase cluster only)

The HBaseMaster node is a virtual machine that runs the HBase master service. This node orchestrates a cluster of one or more RegionServer slave nodes.

Select a resource template from the drop-down menu, or select Customize to customize a resource template.

For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.

Worker Node Group

Worker nodes are virtual machines that run the Hadoop DataNode, TaskTracker, and HBase HRegionServer services. These nodes store HDFS data and execute tasks.

Select the number of nodes and the resource template from the drop-down menu, or select Customize to customize a resource template.

For worker nodes, use local storage.

Note

You can add nodes to the worker node group by using Scale Out Cluster. You cannot reduce the number of nodes.

Client Node Group

A client node is a virtual machine that contains Hadoop client components. From this virtual machine you can access HDFS, submit MapReduce jobs, run Pig scripts, run Hive queries, and HBase commands.

Select the number of nodes and a resource template from the drop-down menu, or select Customize to customize a resource template.

Note

You can add nodes to the client node group by using Scale Out Cluster. You cannot reduce the number of nodes.

Hadoop Topology

Select the topology configuration that you want the cluster to use.

RACK_AS_RACK

HOST_AS_RACK

HVE

NONE

If you do not see the topology configuration that you want, define it in a topology rack-hosts mapping file, and use the Serengeti Command-Line Interface to upload the file to the Serengeti Management Server. See About Cluster Topology

(Optional) If you want to select specific datastores to use with the cluster, select the Do you want to specify datastores to deploy? checkbox. By default, the cluster you create uses all available datastores.

Network

Select one or more networks for the cluster to use.

For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop+HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase RegionServer services.

Important

You cannot configure multiple networks for clusters that use the MapR Hadoop distribution, or clusters managed by Cloudera Manager and Ambari. Only the default Big Data Extensions application manager supports multiple networks.

To use one network for all traffic, select the network from the Network list.

To use separate networks for the management, HDFS, and MapReduce traffic, select Customize the HDFS network and MapReduce network, and select a network from each network list.

Select Datastores

(Optional) The ability to select specific datastores to use with the cluster is only available if you select the Do you want to specify datastores to deploy? checkbox in the Select topology and network pane.

Select the checkbox next to the datastores you want to use with the cluster. If you do not select any datastores, the cluster you create will use all available datastores.

Resource Pools

Select one or more resource pools that you want the cluster to use.

VM Password

Choose how initial administrator passwords are assigned to the virtual machine nodes of the cluster.

Use random password.

Set password.

To assign a custom initial administrator password to all the nodes in the cluster, choose Set password, and type and confirm the initial password.

Passwords must be from 8 to 20 characters, use only visible lower­ASCII characters (no spaces), and must contain at least one uppercase alphabetic character (A - Z), at least one lowercase alphabetic character (a - z), at least one digit (0 - 9), and at least one of the following special characters: _, @, #, $, %, ^, &, *

Important

If you set an initial administrator password, it is used for nodes that are created by future scaling and disk failure recovery operations. If you use the random password, nodes that are created by future scaling and disk failure recovery operations will use new, random passwords.

LDAP user

If LDAP/AD is enabled, you can specify an administrator name group and a normal user group for each cluster. Big Data Extensions creates AD/LDAP connections on the node virtual machines so that the user in these two groups can log in to the node virtual machines. The user in the administrator group has sudo privilege to perform administrative tasks on the node virtual machines.

Local repository URL

Type a local repository URL.

This is an optional item for all application managers. If you specify a local repository URL, the Cloudera Manager or Ambari application manager downloads the required Red Hat Package Managers (RPMs) from the local repository that you specify instead of from a remote repository, which could affect your system performance.

The Serengeti Management Server clones the template virtual machine to create the nodes in the cluster. When each virtual machine starts, the agent on that virtual machine pulls the appropriate Big Data Extensions software components to that node and deploys the software.