Prerequisite Infrastructure

The following components are required prior to setting up the infrastructure needed by e6data. These are commonly present in most cloud environments, but if any are not present, please follow the linked guides below to create them.

  1. Amazon Elastic Kubernetes Service (EKS) cluster

    1. Enable OpenID Connect

      • To provide secure connectivity between e6data clusters and data buckets within your AWS account.

    2. Set up Karpenter

      • To scale the infrastructure for e6data clusters.

    3. Set up AWS Load Balancer Controller

      1. For connectivity between e6data Console & e6data clusters

      2. To allow connectivity between 3rd party tools & e6data clusters.

Create a VPC, Subnets & other VPC Resources

Optional, only required if a VPC is not already present to create an EKS Cluster or to install e6data in a new VPC.

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/

  2. On the VPC dashboard, choose Create VPC.

  3. For Resources to create, choose VPC and more.

  4. Keep Name tag auto-generation selected to create Name tags for the VPC resources, or clear it to provide your own Name tags for the VPC resources.

  5. For IPv4 CIDR block, enter an IPv4 address range for the VPC. A VPC must have an IPv4 address range.

  6. (Optional) To support IPv6 traffic, choose IPv6 CIDR block, Amazon-provided IPv6 CIDR block.

  7. Choose a Tenancy option. This option defines if EC2 instances that you launch into the VPC will run on hardware that's shared with other AWS accounts or on hardware that's dedicated for your use only. If you choose the tenancy of the VPC to be Default, EC2 instances launched into this VPC will use the tenancy attribute specified when you launch the instance -- For more information, see Launch an instance using defined parameters in the Amazon EC2 User Guide for Linux Instances. If you choose the tenancy of the VPC to be Dedicated, the instances will always run as Dedicated Instances on hardware that's dedicated for your use. If you're using AWS Outposts, your Outpost requires private connectivity; you must use Default tenancy.

  8. For Number of Availability Zones (AZs), we recommend that you provision subnets in at least two Availability Zones for a production environment. To choose the AZs for your subnets, expand Customize AZs. Otherwise, let AWS choose them for you.

  9. To configure your subnets, choose values for Number of public subnets and Number of private subnets. To choose the IP address ranges for your subnets, expand Customize subnets CIDR blocks. Otherwise, let AWS choose them for you.

  10. A NAT Gateway is required to export logs/metrics from the private subnet in which the e6data resources are deployed. Choose the number of AZs in which to create NAT gateways. In production, we recommend that you deploy a NAT gateway in each AZ with resources that need access to the public internet. Note that there is a cost associated with NAT gateways. For more information, see Pricing.

  11. (Optional) If you need to access Amazon S3 directly from your VPC, choose VPC endpoints, S3 Gateway. This creates a gateway VPC endpoint for Amazon S3. For more information, see Gateway VPC endpoints in the AWS PrivateLink Guide.

  12. (Optional) For DNS options, both options for domain name resolution are enabled by default. If the default doesn't meet your needs, you can disable these options.

  13. (Optional) To add a tag to your VPC, expand Additional tags, choose Add new tag, and enter a tag key and a tag value.

    • When using a separate VPC for e6data, adding a tag, e.g.: app=e6data is recommended to help monitor usage & costs.

  14. In the Preview pane, you can visualize the relationships between the VPC resources that you've configured. Solid lines represent relationships between resources. Dotted lines represent network traffic to NAT gateways, internet gateways, and gateway endpoints. After you create the VPC, you can visualize the resources in your VPC in this format at any time using the Resource map tab. For more information, see Visualize the resources in your VPC.

  15. When you are finished configuring your VPC, choose Create VPC.

Please make note of the VPC Region, it will be required when creating the Workspace in the e6data Console.

Create EKS Cluster & Default Node Group

Optional, only required if an EKS Cluster is not present or to install e6data in a new EKS Cluster.

Create EKS Cluster

To get started with setting up an Amazon Elastic Kubernetes Service (EKS) cluster, please follow the comprehensive documentation provided by AWS: Creating an Amazon EKS cluster.

Please make note of the EKS Cluster Name, it will be required when creating the Workspace in the e6data Console.

Enable OpenID Connect (OIDC)

Enabling IAM roles for service accounts and creating an OIDC (OpenID Connect) provider for your EKS cluster is crucial in this context because it directly relates to providing secure access for e6data clusters to interact with data buckets within your AWS account.

e6data uses OIDC for more secure access as it provides least privilege, credential isolation & auditability.

To enable IAM roles for service accounts and create an OIDC (OpenID Connect) provider for your EKS cluster, please refer to the documentation Creating an IAM OIDC provider for your cluster - Amazon EKS.

Set up Karpenter

To set up the Karpenter for your Amazon Elastic Kubernetes Service (EKS) cluster, refer to the official Karpenter documentation available on Kapenter documentation.

Use a Restricted IAM Policy: Instead of the broad permissions outlined in the Karpenter documentation for the Karpenter controller OIDC policy, you can use a more restricted IAM policy. Here is an example policy that limits permissions to only e6data-managed resources.

Karpenter has two main components:

EC2 NodeClass

Karpenter NodeClasses serve as customized blueprints for your AWS worker nodes, tailored to your cloud provider's specifications, such as EC2NodeClasses for AWS. These classes define crucial details including the AMI family (OS), security groups, subnets, and IAM roles.

A. Create an e6data EC2 Node Class
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: <NODECLASS_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  amiFamily: AL2
  role: "<KARPENTER_NODE_ROLE_NAME>"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: <EKS_CLUSTER_NAME>
  securityGroupSelectorTerms:
    - tags:
        aws:eks:cluster-name: <EKS_CLUSTER_NAME>
  tags: <TAGS>
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: <VOLUME_SIZE>Gi
        volumeType: gp3
  userData: |
    yum install nvme-cli -y
    # Check if NVMe drives are present
    if nvme list | grep -q "Amazon EC2 NVMe Instance Storage"; then
        # NVMe drives are detected, proceed with NVMe-specific commands
        nvme_drives=$(nvme list | grep "Amazon EC2 NVMe Instance Storage" | cut -d " " -f 1 || true)
        readarray -t nvme_drives <<< "$nvme_drives"
        num_drives=$${#nvme_drives[@]}
        if [ $num_drives -gt 1 ]; then
            # Multiple NVMe drives detected, create RAID array
            yum install mdadm -y
            mdadm --create /dev/md0 --level=0 --name=md0 --raid-devices=$num_drives "$${nvme_drives[@]}"
            mkfs.ext4 /dev/md0
            mount_location="/tmp"
            mount /dev/md0 $mount_location
            mdadm --detail --scan >> /etc/mdadm.conf 
            echo /dev/md0 $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab
        else
            # Single NVMe drive detected, format and mount it
            for disk in "$${nvme_drives[@]}"
            do
                mkfs.ext4 -F $disk
                mount_location="/tmp"
                mount $disk $mount_location
                echo $disk $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab 
            done
        fi
    else
        # NVMe drives are not detected, exit gracefully or skip NVMe-specific commands
        echo "No NVMe drives detected. Skipping NVMe-specific commands."
    fi

NodePool

A single Karpenter NodePool handles various pods, eliminating the need for multiple node groups. Use the command below to create a NodePool with securityGroupSelectorTerms and subnetSelectorTerms for resource discovery. The consolidation policy set to WhenEmpty reduces costs by removing empty nodes.

B. Create an e6data Node Pool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: <NODEPOOL_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  template:
    metadata:
      labels:
        app: e6data
        e6data-workspace-name: <WORKSPACE_NAME>
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["m5","m5d","c5","c5d","c4","r4",.....]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b",...]
      nodeClassRef:
        name: <NODECLASS_NAME>
      taints:
        - key: "e6data-workspace-name"
          value: <WORKSPACE_NAME>
          effect: NoSchedule  
  limits:
    cpu: 100000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Set up AWS Load Balancer Controller (ALB)

An AWS Load Balancer (ALB) is required in the EKS Cluster for connectivity between the e6data Console and e6data Cluster, as well as for providing connectivity between querying/BI tools and the e6data Query Engine.

To install the AWS Load Balancer Controller (ALB) in your Amazon Elastic Kubernetes Service (EKS) cluster, follow the steps outlined in the official AWS documentation at Installing the AWS Load Balancer Controller add-on - Amazon EKS.

Please take note of the following:

  1. Ensure that the ALB load balancer controller is configured with version v2.5 or higher.

  2. Configure the ALB controller with the parameters listed below:

enableShield = false
enableWafv2 = false
enableWaf = false
defaultTags.app = "e6data"

Use a Restricted IAM Policy: Instead of using the broad permissions mentioned in the ALB Controller documentation, you can use a more restricted IAM policy. Here is an example policy that you can use to limit permissions to only e6data-managed resources.

Last updated

#930: Cross account hive GCP

Change request updated