Prerequisite Infrastructure

The following components are required prior to setting up the infrastructure needed by e6data. These are commonly present in most cloud environments, but if any are not present, please follow the linked guides below to create them.

  1. Amazon Elastic Kubernetes Service (EKS) cluster

    1. Enable OpenID Connect

      • To provide secure connectivity between e6data clusters and data buckets within your AWS account.

    2. Set up Karpenter

      • To scale the infrastructure for e6data clusters.

    3. Set up AWS Load Balancer Controller

      1. For connectivity between e6data Console & e6data clusters

      2. To allow connectivity between 3rd party tools & e6data clusters.

Create a VPC, Subnets & other VPC Resources

Optional, only required if a VPC is not already present to create an EKS Cluster or to install e6data in a new VPC.

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/

  2. On the VPC dashboard, choose Create VPC.

  3. For Resources to create, choose VPC and more.

  4. Keep Name tag auto-generation selected to create Name tags for the VPC resources or clear it to provide your own Name tags for the VPC resources.

  5. For IPv4 CIDR block, enter an IPv4 address range for the VPC. A VPC must have an IPv4 address range.

  6. (Optional) To support IPv6 traffic, choose IPv6 CIDR block, Amazon-provided IPv6 CIDR block.

    • Choose a Tenancy option to determine whether EC2 instances in the VPC run on shared hardware or hardware dedicated to your use.

    • Default tenancy allows instances to use the tenancy specified at launch, while Dedicated tenancy ensures all instances run on dedicated hardware. Learn more about Dedicated Instances.

    • AWS Outposts require Default tenancy due to private connectivity. For more details, refer to Launch an instance using defined parameters in the Amazon EC2 User Guide for Linux Instances.

  1. For Number of Availability Zones (AZs), it's recommended to provision subnets in at least two AZs for production. To select specific AZs, expand Customize AZs; otherwise, AWS will choose them automatically.

  2. To configure your subnets, choose values for Number of public subnets and Number of private subnets. To choose the IP address ranges for your subnets, expand Customize subnets CIDR blocks. Otherwise, let AWS choose them for you.

    • A NAT Gateway is required to export logs and metrics from the private subnet where e6data resources are deployed.

    • Choose the number of Availability Zones (AZs) in which to create NAT gateways.

    • For production environments, it's recommended to deploy a NAT Gateway in each AZ that contains resources needing access to the public internet.

    • ⚠️ Note: NAT Gateways incur additional cost. For more information, see the NAT Gateway Pricing page.

  3. (Optional) If you need to access Amazon S3 directly from your VPC, choose VPC endpoints, S3 Gateway. This creates a gateway VPC endpoint for Amazon S3. For more information, see Gateway VPC endpoints in the AWS PrivateLink Guide.

  4. (Optional) For DNS options, both options for domain name resolution are enabled by default. If the default doesn't meet your needs, you can disable these options.

  5. (Optional) To add a tag to your VPC, expand Additional tags, choose Add new tag, and enter a tag key and a tag value.

    • When using a separate VPC for e6data, adding a tag, e.g.: app=e6data is recommended to help monitor usage & costs.

    • In the Preview pane, you can visualize relationships between the VPC resources you've configured.

    • Solid lines show resource relationships; dotted lines indicate network traffic paths to NAT gateways, internet gateways, and gateway endpoints.

    • After creating the VPC, you can view this diagram anytime via the Resource map tab.

  6. When you are finished configuring your VPC, choose Create VPC.

Create EKS Cluster & Default Node Group

Optional, only required if an EKS Cluster is not present or to install e6data in a new EKS Cluster.

Create EKS Cluster

To get started with setting up an Amazon Elastic Kubernetes Service (EKS) cluster, please follow the comprehensive documentation provided by AWS: Creating an Amazon EKS cluster.

Enable OpenID Connect (OIDC)

Enabling IAM roles for service accounts and creating an OIDC (OpenID Connect) provider for your EKS cluster is crucial in this context because it directly relates to providing secure access for e6data clusters to interact with data buckets within your AWS account.

e6data uses OIDC for more secure access as it provides least privilege, credential isolation & auditability.

To enable IAM roles for service accounts and create an OIDC (OpenID Connect) provider for your EKS cluster, please refer to the documentation Creating an IAM OIDC provider for your cluster - Amazon EKS.

Set up Karpenter

To set up the Karpenter for your Amazon Elastic Kubernetes Service (EKS) cluster, refer to the official Karpenter documentation available on Kapenter documentation.

Use a Restricted IAM Policy: Instead of the broad permissions outlined in the Karpenter documentation for the Karpenter controller OIDC policy, you can use a more restricted IAM policy. Here is an example policy that limits permissions to only e6data-managed resources.

Karpenter has two main components:

EC2 NodeClass

Karpenter NodeClasses serve as customized blueprints for your AWS worker nodes, tailored to your cloud provider's specifications, such as EC2NodeClasses for AWS. These classes define crucial details including the AMI family (OS), security groups, subnets, and IAM roles.

A. Create an e6data EC2 Node Class
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: <NODECLASS_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  amiFamily: AL2
  role: "<KARPENTER_NODE_ROLE_NAME>"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: <EKS_CLUSTER_NAME>
  securityGroupSelectorTerms:
    - tags:
        aws:eks:cluster-name: <EKS_CLUSTER_NAME>
  tags: <TAGS>
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: <VOLUME_SIZE>Gi
        volumeType: gp3
  userData: |
    yum install nvme-cli -y
    # Check if NVMe drives are present
    if nvme list | grep -q "Amazon EC2 NVMe Instance Storage"; then
        # NVMe drives are detected, proceed with NVMe-specific commands
        nvme_drives=$(nvme list | grep "Amazon EC2 NVMe Instance Storage" | cut -d " " -f 1 || true)
        readarray -t nvme_drives <<< "$nvme_drives"
        num_drives=$${#nvme_drives[@]}
        if [ $num_drives -gt 1 ]; then
            # Multiple NVMe drives detected, create RAID array
            yum install mdadm -y
            mdadm --create /dev/md0 --level=0 --name=md0 --raid-devices=$num_drives "$${nvme_drives[@]}"
            mkfs.ext4 /dev/md0
            mount_location="/tmp"
            mount /dev/md0 $mount_location
            mdadm --detail --scan >> /etc/mdadm.conf 
            echo /dev/md0 $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab
        else
            # Single NVMe drive detected, format and mount it
            for disk in "$${nvme_drives[@]}"
            do
                mkfs.ext4 -F $disk
                mount_location="/tmp"
                mount $disk $mount_location
                echo $disk $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab 
            done
        fi
    else
        # NVMe drives are not detected, exit gracefully or skip NVMe-specific commands
        echo "No NVMe drives detected. Skipping NVMe-specific commands."
    fi

NodePool

A single Karpenter NodePool handles various pods, eliminating the need for multiple node groups. Use the command below to create a NodePool with securityGroupSelectorTerms and subnetSelectorTerms for resource discovery. The consolidation policy set to WhenEmpty reduces costs by removing empty nodes.

B. Create an e6data Node Pool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: <NODEPOOL_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  template:
    metadata:
      labels:
        app: e6data
        e6data-workspace-name: <WORKSPACE_NAME>
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["m5","m5d","c5","c5d","c4","r4",.....]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b",...]
      nodeClassRef:
        name: <NODECLASS_NAME>
      taints:
        - key: "e6data-workspace-name"
          value: <WORKSPACE_NAME>
          effect: NoSchedule  
  limits:
    cpu: 100000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Set up AWS Load Balancer Controller (ALB)

An AWS Load Balancer (ALB) is required in the EKS Cluster for connectivity between the e6data Console and e6data Cluster, as well as for providing connectivity between querying/BI tools and the e6data Query Engine.

To install the AWS Load Balancer Controller (ALB) in your Amazon Elastic Kubernetes Service (EKS) cluster, follow the steps outlined in the official AWS documentation at Installing the AWS Load Balancer Controller add-on - Amazon EKS.

Please take note of the following:

  1. Ensure that the ALB load balancer controller is configured with version v2.5 or higher.

  2. Configure the ALB controller with the parameters listed below:

enableShield = false
enableWafv2 = false
enableWaf = false
defaultTags.app = "e6data"

Use a Restricted IAM Policy: Instead of using the broad permissions mentioned in the ALB Controller documentation, you can use a more restricted IAM policy. Here is an example policy that you can use to limit permissions to only e6data-managed resources.

Last updated