LogoLogo
  • Welcome to e6data
  • Introduction to e6data
    • Concepts
    • Architecture
      • e6data in VPC Deployment Model
      • Connect to e6data serverless compute
  • Get Started
  • Sign Up
  • Setup
    • AWS Setup
      • In VPC Deployment (AWS)
        • Prerequisite Infrastructure
        • Infrastructure & Permissions for e6data
        • Setup Kubernetes Components
        • Setup using Terraform in AWS
          • Update a AWS Terraform for your Workspace
        • AWS PrivateLink and e6data
        • VPC Peering | e6data on AWS
      • Connect to e6data serverless compute (AWS)
        • Workspace Creation
        • Catalog Creation
          • Glue Metastore
          • Hive Metastore
          • Unity Catalog
        • Cluster Creation
    • GCP Setup
      • In VPC Deployment (GCP)
        • Prerequisite Infrastructure
        • Infrastructure & Permissions for e6data
        • Setup Kubernetes Components
        • Setup using Terraform in GCP
        • Update a GCP Terraform for your Workspace
      • Connect to e6data serverless compute (GCP)
    • Azure Setup
      • Prerequisite Infrastructure
      • Infrastructure & Permissions for e6data
      • Setup Kubernetes Components
      • Setup using Terraform in AZURE
        • Update a AZURE Terraform for your Workspace
  • Workspaces
    • Create Workspaces
    • Enable/Disable Workspaces
    • Update a Workspace
    • Delete a Workspace
  • Catalogs
    • Create Catalogs
      • Hive Metastore
        • Connect to a Hive Metastore
        • Edit a Hive Metastore Connection
        • Delete a Hive Metastore Connection
      • Glue Metastore
        • Connect to a Glue Metastore
        • Edit a Glue Metastore Connection
        • Delete a Glue Metastore Connection
      • Unity Catalog
        • Connect to Unity Catalog
        • Edit Unity Catalog
        • Delete Unity Catalog
      • Cross-account Catalog Access
        • Configure Cross-account Catalog to Access AWS Hive Metastore
        • Configure Cross-account Catalog to Access Unity Catalog
        • Configure Cross-account Catalog to Access AWS Glue
        • Configure Cross-account Catalog to Access GCP Hive Metastore
    • Manage Catalogs
    • Privileges
      • Access Control
      • Column Masking
      • Row Filter
  • Clusters
    • Edit & Delete Clusters
    • Suspend & Resume Clusters
    • Cluster Size
    • Load Based Sizing
    • Auto Suspension
    • Query Timeout
    • Monitoring
    • Connection Info
  • Pools
    • Delete Pools
  • Query Editor
    • Editor Pane
    • Results Pane
    • Schema Explorer
    • Data Preview
  • Notebook
    • Editor Pane
    • Results Pane
    • Schema Explorer
    • Data Preview
  • Query History
    • Query Count API
  • Connectivity
    • IP Sets
    • Endpoints
    • Cloud Resources
    • Network Firewall
  • Access Control
    • Users
    • Groups
    • Roles
      • Permissions
      • Policies
    • Single Sign-On (SSO)
      • AWS SSO
      • Okta
      • Microsoft My Apps-SSO
      • Icons for IdP
    • Service Accounts
    • Multi-Factor Authentication (Beta)
  • Usage and Cost Management
  • Audit Log
  • User Settings
    • Profile
    • Personal Access Tokens (PAT)
  • Advanced Features
    • Cross-Catalog & Cross-Schema Querying
  • Supported Data Types
  • SQL Command Reference
    • Query Syntax
      • General functions
    • Aggregate Functions
    • Mathematical Functions & Operators
      • Arithematic Operators
      • Rounding and Truncation Functions
      • Exponential and Root Functions
      • Trigonometric Functions
      • Logarithmic Functions
    • String Functions
    • Date-Time Functions
      • Constant Functions
      • Conversion Functions
      • Date Truncate Function
      • Addition and Subtraction Functions
      • Extraction Functions
      • Format Functions
      • Timezone Functions
    • Conditional Expressions
    • Conversion Functions
    • Window Functions
    • Comparison Operators & Functions
    • Logical Operators
    • Statistical Functions
    • Bitwise Functions
    • Array Functions
    • Regular Expression Functions
    • Generate Functions
    • Cardinality Estimation Functions
    • JSON Functions
    • Checksum Functions
    • Unload Function (Copy into)
    • Struct Functions
  • Equivalent Functions & Operators
  • Connectors & Drivers
    • DBeaver
    • DbVisualiser
    • Apache Superset
    • Jupyter Notebook
    • Tableau Cloud
    • Tableau Desktop
    • Power BI
    • Metabase
    • Zeppelin
    • Python Connector
      • Code Samples
    • JDBC Driver
      • Code Samples
      • API Support
    • Configure Cluster Ingress
      • ALB Ingress in Kubernetes
      • GCE Ingress in Kubernetes
      • Ingress-Nginx in Kubernetes
  • Security & Trust
    • Best Practices
      • AWS Best Practices
    • Features & Responsibilities Matrix
    • Data Protection Addendum(DPA)
  • Tutorials and Best Practices
    • How to configure HIVE metastore if you don't have one?
    • How-To Videos
  • Known Limitations
    • SQL Limitations
    • Other Limitations
    • Restart Triggers
    • Cloud Provider Limitations
  • Error Codes
    • General Errors
    • User Account Errors
    • Workspace Errors
    • Catalog Errors
    • Cluster Errors
    • Data Governance Errors
    • Query History Errors
    • Query Editor Errors
    • Pool Errors
    • Connectivity Errors
  • Terms & Condition
  • Privacy Policy
    • Cookie Policy
  • FAQs
    • Workspace Setup
    • Security
    • Catalog Privileges
  • Services Utilised for e6data Deployment
    • AWS supported regions
    • GCP supported regions
    • AZURE supported regions
  • Release Notes & Updates
    • 6th Sept 2024
    • 6th June 2024
    • 18th April 2024
    • 9th April 2024
    • 30th March 2024
    • 16th March 2024
    • 14th March 2024
    • 12th March 2024
    • 2nd March 2024
    • 10th February 2024
    • 3rd February 2024
    • 17th January 2024
    • 9th January 2024
    • 3rd January 2024
    • 18th December 2023
    • 12th December 2023
    • 9th December 2023
    • 4th December 2023
    • 27th November 2023
    • 8th September 2023
    • 4th September 2023
    • 26th August 2023
    • 21st August 2023
    • 19th July 2023
    • 23rd May 2023
    • 5th May 2023
    • 28th April 2023
    • 19th April 2023
    • 15th April 2023
    • 10th April 2023
    • 30th March 2023
Powered by GitBook
On this page
  • Create a VPC, Subnets & other VPC Resources
  • Create EKS Cluster & Default Node Group
  • Create EKS Cluster
  • Enable OpenID Connect (OIDC)
  • Set up Karpenter
  • EC2 NodeClass
  • NodePool
  • Set up AWS Load Balancer Controller (ALB)
  1. Setup
  2. AWS Setup
  3. In VPC Deployment (AWS)

Prerequisite Infrastructure

PreviousIn VPC Deployment (AWS)NextInfrastructure & Permissions for e6data

Last updated 8 months ago

The following components are required prior to setting up the infrastructure needed by e6data. These are commonly present in most cloud environments, but if any are not present, please follow the linked guides below to create them.

      • To provide secure connectivity between e6data clusters and data buckets within your AWS account.

      • To scale the infrastructure for e6data clusters.

      1. For connectivity between e6data Console & e6data clusters

      2. To allow connectivity between 3rd party tools & e6data clusters.

Create a VPC, Subnets & other VPC Resources

Optional, only required if a VPC is not already present to create an EKS Cluster or to install e6data in a new VPC.

  1. Open the Amazon VPC console at

  2. On the VPC dashboard, choose Create VPC.

  3. For Resources to create, choose VPC and more.

  4. Keep Name tag auto-generation selected to create Name tags for the VPC resources, or clear it to provide your own Name tags for the VPC resources.

  5. For IPv4 CIDR block, enter an IPv4 address range for the VPC. A VPC must have an IPv4 address range.

  6. (Optional) To support IPv6 traffic, choose IPv6 CIDR block, Amazon-provided IPv6 CIDR block.

  7. Choose a Tenancy option. This option defines if EC2 instances that you launch into the VPC will run on hardware that's shared with other AWS accounts or on hardware that's dedicated for your use only. If you choose the tenancy of the VPC to be Default, EC2 instances launched into this VPC will use the tenancy attribute specified when you launch the instance -- For more information, see in the Amazon EC2 User Guide for Linux Instances. If you choose the tenancy of the VPC to be Dedicated, the instances will always run as on hardware that's dedicated for your use. If you're using AWS Outposts, your Outpost requires private connectivity; you must use Default tenancy.

  8. For Number of Availability Zones (AZs), we recommend that you provision subnets in at least two Availability Zones for a production environment. To choose the AZs for your subnets, expand Customize AZs. Otherwise, let AWS choose them for you.

  9. To configure your subnets, choose values for Number of public subnets and Number of private subnets. To choose the IP address ranges for your subnets, expand Customize subnets CIDR blocks. Otherwise, let AWS choose them for you.

  10. A NAT Gateway is required to export logs/metrics from the private subnet in which the e6data resources are deployed. Choose the number of AZs in which to create NAT gateways. In production, we recommend that you deploy a NAT gateway in each AZ with resources that need access to the public internet. Note that there is a cost associated with NAT gateways. For more information, see .

  11. (Optional) If you need to access Amazon S3 directly from your VPC, choose VPC endpoints, S3 Gateway. This creates a gateway VPC endpoint for Amazon S3. For more information, see in the AWS PrivateLink Guide.

  12. (Optional) For DNS options, both options for domain name resolution are enabled by default. If the default doesn't meet your needs, you can disable these options.

  13. (Optional) To add a tag to your VPC, expand Additional tags, choose Add new tag, and enter a tag key and a tag value.

    • When using a separate VPC for e6data, adding a tag, e.g.: app=e6data is recommended to help monitor usage & costs.

  14. In the Preview pane, you can visualize the relationships between the VPC resources that you've configured. Solid lines represent relationships between resources. Dotted lines represent network traffic to NAT gateways, internet gateways, and gateway endpoints. After you create the VPC, you can visualize the resources in your VPC in this format at any time using the Resource map tab. For more information, see .

  15. When you are finished configuring your VPC, choose Create VPC.

Please make note of the VPC Region, it will be required when creating the Workspace in the e6data Console.

Create EKS Cluster & Default Node Group

Optional, only required if an EKS Cluster is not present or to install e6data in a new EKS Cluster.

Create EKS Cluster

Please make note of the EKS Cluster Name, it will be required when creating the Workspace in the e6data Console.

Enable OpenID Connect (OIDC)

Enabling IAM roles for service accounts and creating an OIDC (OpenID Connect) provider for your EKS cluster is crucial in this context because it directly relates to providing secure access for e6data clusters to interact with data buckets within your AWS account.

e6data uses OIDC for more secure access as it provides least privilege, credential isolation & auditability.

Set up Karpenter

Karpenter has two main components:

EC2 NodeClass

Karpenter NodeClasses serve as customized blueprints for your AWS worker nodes, tailored to your cloud provider's specifications, such as EC2NodeClasses for AWS. These classes define crucial details including the AMI family (OS), security groups, subnets, and IAM roles.

A. Create an e6data EC2 Node Class
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: <NODECLASS_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  amiFamily: AL2
  role: "<KARPENTER_NODE_ROLE_NAME>"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: <EKS_CLUSTER_NAME>
  securityGroupSelectorTerms:
    - tags:
        aws:eks:cluster-name: <EKS_CLUSTER_NAME>
  tags: <TAGS>
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: <VOLUME_SIZE>Gi
        volumeType: gp3
  userData: |
    yum install nvme-cli -y
    # Check if NVMe drives are present
    if nvme list | grep -q "Amazon EC2 NVMe Instance Storage"; then
        # NVMe drives are detected, proceed with NVMe-specific commands
        nvme_drives=$(nvme list | grep "Amazon EC2 NVMe Instance Storage" | cut -d " " -f 1 || true)
        readarray -t nvme_drives <<< "$nvme_drives"
        num_drives=$${#nvme_drives[@]}
        if [ $num_drives -gt 1 ]; then
            # Multiple NVMe drives detected, create RAID array
            yum install mdadm -y
            mdadm --create /dev/md0 --level=0 --name=md0 --raid-devices=$num_drives "$${nvme_drives[@]}"
            mkfs.ext4 /dev/md0
            mount_location="/tmp"
            mount /dev/md0 $mount_location
            mdadm --detail --scan >> /etc/mdadm.conf 
            echo /dev/md0 $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab
        else
            # Single NVMe drive detected, format and mount it
            for disk in "$${nvme_drives[@]}"
            do
                mkfs.ext4 -F $disk
                mount_location="/tmp"
                mount $disk $mount_location
                echo $disk $mount_location ext4 defaults,noatime 0 2 >> /etc/fstab 
            done
        fi
    else
        # NVMe drives are not detected, exit gracefully or skip NVMe-specific commands
        echo "No NVMe drives detected. Skipping NVMe-specific commands."
    fi

NodePool

A single Karpenter NodePool handles various pods, eliminating the need for multiple node groups. Use the command below to create a NodePool with securityGroupSelectorTerms and subnetSelectorTerms for resource discovery. The consolidation policy set to WhenEmpty reduces costs by removing empty nodes.

B. Create an e6data Node Pool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: <NODEPOOL_NAME>
  labels:
    app: e6data
    e6data-workspace-name: <WORKSPACE_NAME>
spec:
  template:
    metadata:
      labels:
        app: e6data
        e6data-workspace-name: <WORKSPACE_NAME>
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["m5","m5d","c5","c5d","c4","r4",.....]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b",...]
      nodeClassRef:
        name: <NODECLASS_NAME>
      taints:
        - key: "e6data-workspace-name"
          value: <WORKSPACE_NAME>
          effect: NoSchedule  
  limits:
    cpu: 100000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Set up AWS Load Balancer Controller (ALB)

An AWS Load Balancer (ALB) is required in the EKS Cluster for connectivity between the e6data Console and e6data Cluster, as well as for providing connectivity between querying/BI tools and the e6data Query Engine.

Please take note of the following:

  1. Ensure that the ALB load balancer controller is configured with version v2.5 or higher.

  2. Configure the ALB controller with the parameters listed below:

enableShield = false
enableWafv2 = false
enableWaf = false
defaultTags.app = "e6data"

To get started with setting up an Amazon Elastic Kubernetes Service (EKS) cluster, please follow the comprehensive documentation provided by AWS: .

To enable IAM roles for service accounts and create an OIDC (OpenID Connect) provider for your EKS cluster, please refer to the documentation .

To set up the Karpenter for your Amazon Elastic Kubernetes Service (EKS) cluster, refer to the official Karpenter documentation available on

Use a Restricted IAM Policy: Instead of the broad permissions outlined in the Karpenter documentation for the Karpenter controller OIDC policy, you can use a more restricted IAM policy. Here is an example that limits permissions to only e6data-managed resources.

To install the AWS Load Balancer Controller (ALB) in your Amazon Elastic Kubernetes Service (EKS) cluster, follow the steps outlined in the official AWS documentation at .

Use a Restricted IAM Policy: Instead of using the broad permissions mentioned in the ALB Controller documentation, you can use a more restricted IAM policy. Here is an example that you can use to limit permissions to only e6data-managed resources.

Creating an Amazon EKS cluster
Creating an IAM OIDC provider for your cluster - Amazon EKS
Kapenter documentation.
policy
Installing the AWS Load Balancer Controller add-on - Amazon EKS
policy
https://console.aws.amazon.com/vpc/
Launch an instance using defined parameters
Dedicated Instances
Pricing
Gateway VPC endpoints
Visualize the resources in your VPC
Virtual Private Cloud (VPC)
Amazon Elastic Kubernetes Service (EKS) cluster
Enable OpenID Connect
Set up Karpenter
Set up AWS Load Balancer Controller
EC2 NodeClass
NodePool