Setup using Terraform in GCP

Deploying an e6data Workspace in GCP using Terraform

Please ensure to sign up for an e6data account before creating workspaces.

Once logged into the e6data platform, it’s time to configure an e6data workspace in GCP. We keep it simple - all you need is an existing GCP account with the prerequisites listed below:

Prerequisites

Create the e6data Workspace

  1. Login to the e6data Console

  2. Navigate to Workspaces in the left-side navigation bar or click Create Workspace

  3. Select GCP as the Cloud Provider.

  4. Proceed to the next section to deploy the Workspace.

Installing the e6data Workspace

Using a Terraform script, the e6data Workspace will be deployed inside a GKE Cluster. The subsequent sections will provide instructions to create the two Terraform files required for the deployment:

If a GKE cluster is not available, please follow these instructions to create one.

If Terraform is not installed, please follow these instructions.

Download e6data Terraform Scripts

Please download/clone the e6x-labs/terraform repo from Github.

Download e6data Workspace Deployment Terraform scripts from Github

Configure provider.tf

The google provider blocks are used to configure the credentials you use to authenticate with GCP, as well as the default project and location of your resources.

Extract the scripts downloaded in the previous step and navigate to the ./scripts/gcp/terraform folder.

Edit the provider.tf file according to your requirements. Please refer to the official Terraform documentation to find instructions to use the authentication method most appropriate to your environment.

Sample provider.tf
provider.tf (sample)
terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "4.72.0"
    }
  }
}

provider "google" {
    project = var.gcp_project_id
    region = var.gcp_region
    credentials = "{{GOOGLE_CLOUD_KEYFILE_JSON}}"
    #access_token = "{{ gcp_access_token }}"

  backend "gcs" {
    bucket  = "<bucket_name_to_store_the_tfstate_file>"
    prefix  = "terraform/state"
  }
}

Specifying Google Cloud Storage (GCS) Bucket for Terraform State file

To specify a Google Cloud Storage (GCS) bucket for storing the Terraform state when using Google Cloud Platform (GCP) as the provider, you can add the following configuration to the Terraform script:

When using the Google Cloud provider and configuring the backend to use a GCS bucket for storing the Terraform state. Replace the <bucket_name_to_store_the_tfstate_file> with the desired GCS bucket name that you want to use.

Additionally, the prefix parameter allows you to specify a directory or path within the bucket where the Terraform state file will be stored. Adjust the prefix value according to your requirements.

  • Make sure that the credentials used by Terraform have the necessary permissions to read from and write to the specified GCS bucket.

  • For more information and to explore additional backend options, you can refer to the Terraform Backend Configuration documentation.

Configuration Variables in terraform.tfvars File

The terraform.tfvars file contains the following variables that need to be configured before executing the Terraform script.

Edit the following variables in theterraform.tfvars file according to your needs:

Sample terraform.tfvars
terraform.tfvars (sample)
# GCP Variables
gcp_region                          = "us-central1"                          # The region where the e6data resources will be created
gcp_project_id                      = "<project_ID>"                          # The ID of the GCP project

# e6data Workspace Variables
workspace_name                      = "workspace"                             # The name of the e6data workspace
# Note: The variable workspace_name should meet the following criteria:
# a) Only lowercase alphanumeric characters are accepted.
# b) Must have a minimum of 3 characters.

helm_chart_version                  = "2.0.5"                               ### e6data workspace Helm chart version to be used.

# Network Variables
gke_subnet_ip_cidr_range            = "10.100.0.0/18"                        # The subnet IP range for GKE

gke_e6data_master_ipv4_cidr_block   = "10.103.4.0/28" 
# The IP range in CIDR notation to use for the hosted master network
# This range will be used for assigning private IP addresses to the cluster master(s) and the ILB VIP
# This range must not overlap with any other ranges in use within the cluster's network, and it must be a /28 subnet

# Kubernetes Variables
gke_version                         = "1.28.8"                               # The version of GKE to use                
gke_encryption_state                = "DECRYPTED"                            # The encryption state for GKE (It is recommended to use encryption)
gke_dns_cache_enabled               = true                                   # The status of the NodeLocal DNSCache addon.
spot_enabled                        = true                                   # A boolean that represents whether the underlying node VMs are spot.

# GKE Cluster variables
cluster_name                        = "gke-cluster-name"                      # The name of the GKE cluster
default_nodepool_instance_type      = "e2-standard-2"                        # The default instance type for the node pool

gke_e6data_initial_node_count       = 1                                      # The initial number of nodes in the GKE cluster
gke_e6data_max_pods_per_node        = 64                                      # The maximum number of pods per node in the GKE cluster
gke_e6data_instance_type            = "c2-standard-30"                       # The instance type for the GKE nodes
max_instances_in_nodepool          = 50                                      # The maximum number of instances in a node group

# Kubernetes Namespace
kubernetes_namespace                = "namespace"                            # The namespace to use for Kubernetes resources

# Cost Labels
cost_labels = {}                            # Cost labels for tracking costs
# Note: The variable cost_labels only accepts lowercase letters ([a-z]), numeric characters ([0-9]), underscores (_) and dashes (-).

buckets                             = ["*"]                                 ### List of bucket names that the e6data engine queries and therefore, require read access to. The default is ["*"] which means all buckets, it is advisable to change this.

Please update the values of these variables in the terraform.tfvars file to match the specific configuration details for your environment:

workspace_name

The name of the e6data workspace to be created.

gcp_project_id

The Google Cloud Platform (GCP) project ID to deploy the e6data workspace.

gcp_region

The GCP region to run the e6data workspace.

helm_chart_version

e6data workspace Helm chart version to be used.

gke_subnet_ip_cidr_range

The subnet IP range of GKE

cluster_name

The name of the Kubernetes cluster for e6data.

gke_e6data_master_ipv4_cidr_block

The IP range in CIDR notation to use for the hosted master network.

gke_version

The version of GKE to use

gke_encryption_state

The encryption state for GKE (recommended)

gke_dns_cache_enabled

The status of the NodeLocal DNSCache addon.

spot_enabled

A boolean that represents whether the underlying node VMs are spot.

kubernetes_cluster_zone

The Kubernetes cluster zone (only required for zonal clusters).

max_instances_in_nodepool

The maximum number of instances in a nodepool.

default_nodepool_instance_type

The default instance type for the node pool.

gke_e6data_initial_node_count

The initial number of nodes in the GKE cluster

gke_e6data_max_pods_per_node

The maximum number of pods per node in the GKE cluster

gke_e6data_instance_type

The instance type for the GKE nodes

kubernetes_namespace

The Kubernetes namespace to deploy the e6data workspace.

cost_labels

Cost labels for tracking costs

buckets

List of bucket names that the e6data engine queries and therefore, require read access to. Default is ["*"] which means all buckets, it is advisable to change this.

Execution Commands

Once you have configured the necessary variables in the provider.tf & terraform.tfvars files, you can proceed with the deployment of the e6data workspace. Follow the steps below to initiate the deployment:

  1. Navigate to the directory containing the Terraform files. It is essential to be in the correct directory for the Terraform commands to execute successfully.

  2. Initialize Terraform: terraform init

  3. Generate a Terraform plan and save it to a file (e.g. e6.plan): terraform plan -var-file="terraform.tfvars" --out="e6.plan".

    • The -var-file flag specifies the input variable file (terraform.tfvars) that contains the necessary configuration values for the deployment.

  4. Review the generated plan.

  5. Apply the changes using the generated plan file: terraform apply "e6.plan"

This command applies the changes specified in the plan file (e6.plan) to deploy the e6data workspace in your environment.

  1. Make note of the values returned by the script.

  2. Return to the e6data Console and enter the values returned in the previous step.

Deployment Overview and Resource Provisioning

This section provides a comprehensive overview of the resources deployed using the Terraform script for the e6data workspace deployment.

  • Only the e6data engine, residing within the customer account has permission access to data stores.

  • The cross-account role does not have access to data stores, therefore access to data stores from the e6data platform is not possible.

Permissions

Engine Permissions

The e6data Engine which is deployed inside the customer boundary, requires the following permissions:

Read-only access to buckets containing the data to be queried:

permissions = [
    "storage.objects.getIamPolicy",
    "storage.objects.get",
    "storage.objects.list",
]

Read-write access to a bucket created by e6data to store query results, logs, etc.:

permissions = [
    "storage.objects.setIamPolicy",
    "storage.objects.getIamPolicy",
    "storage.objects.update",
    "storage.objects.create",
    "storage.objects.delete",
    "storage.objects.get",
    "storage.objects.list",
]

Permissions for connectivity with the e6data control plane

The workloadIdentityUser role requires the following permissions for authentication and interaction with the cluster:

permissions  = [
    "iam.serviceAccounts.get",
    "iam.serviceAccounts.getAccessToken",
    "iam.serviceAccounts.getOpenIdToken",
    "iam.serviceAccounts.list"
]

The e6dataclusterViewer role requires the following permissions to monitor e6data cluster health:

permissions  = [
    "container.clusters.get",
    "container.clusters.list",
    "container.roleBindings.get",
    "container.backendConfigs.get",
    "container.backendConfigs.create",
    "container.backendConfigs.delete",
    "container.backendConfigs.update",
    "resourcemanager.projects.get",
    "compute.sslCertificates.get",
    "compute.forwardingRules.list"
]

The globalAddresses role requires the following permissions to manage glabladdress :

    "compute.globalAddresses.create",
    "compute.globalAddresses.delete",
    "compute.globalAddresses.get",
    "compute.globalAddresses.setLabels"

The securityPolicies role requires the following permissions to manage glabladdress :

"compute.securityPolicies.create",
"compute.securityPolicies.get",
"compute.securityPolicies.delete",
"compute.securityPolicies.update"

The targetPools role requires the following permissions to get the instance list for targetpool or backendconfig.

   "compute.instances.get",
    "compute.targetPools.get",
    "compute.targetPools.list"

Resources Created

  1. This Terraform configuration sets up a network, subnetwork, router, and NAT gateway on Google Cloud Platform (GCP). The network is created without auto-creating subnetworks. The subnetwork is configured with a specified IP CIDR range, region, and enables private Google access. VPC flow logs can be configured for the subnetwork. A router is created and attached to the network. A Cloud NAT gateway is provisioned for internet access for private nodes, with automatic IP allocation and NAT configuration for all subnetwork IP ranges.

  2. This Terraform configuration sets up a private and regional Google Kubernetes Engine (GKE) cluster. The cluster is configured with a specified name, region, minimum master version, monitoring and logging services, network, subnetwork, and initial node count. Vertical Pod Autoscaling is enabled, and workload identity is configured.

  3. The cluster's private configuration includes private nodes, a master IPv4 CIDR block, and a disabled private endpoint. IP allocation policy, HTTP load balancing, and DNS cache configuration are also defined. Resource labels and master authorized networks are configured, and a lifecycle block specifies to creation of the cluster before destroying any existing one to ensure less downtime.

  4. GKE Node Pool for Workspace: Provides a dedicated node pool in GKE to host the e6data workspace, with autoscaling and location policies for scalability and performance.

  5. GCS Bucket for Query Results: Establishes a dedicated Google Cloud Storage (GCS) bucket within the e6data workspace to store query results.

  6. Service Account for Workspace: The deployment includes creating a dedicated service account that ensures secure access to e6data workspace resources. This service account is assigned a custom role that grants read access to GCS buckets, enabling the e6data engine to retrieve data for querying and processing operations. Additionally, the service account is also provided with read/write access to the e6data workspace bucket. This access allows the e6data platform to write query results to the bucket, providing efficient storage and management of workspace-related data.

  7. IAM Policy Bindings for Workspace Workload Identity: This creates an IAM policy binding, enabling the workspace service account to act as a workload identity user in the Kubernetes cluster. This binding grants the necessary permissions for authentication and interaction within the cluster, facilitating seamless integration between the e6data workspace and Kubernetes infrastructure.

  8. IAM Policy Bindings for Platform Workload Identity:

    This creates an IAM policy binding between the platform service account and the Kubernetes cluster, granting the platform service account the "roles/container.clusterViewer" role. This role provides the necessary permissions to view and access the Kubernetes cluster.

  9. Helm Release: The Helm release in the provided Terraform code provisions and assigns cluster roles to the e6data control plane user. These cluster roles grant specific permissions and access within the Kubernetes cluster to the e6data control plane user. The defined permissions include the ability to manage various resources such as pods, nodes, services, ingresses, configmaps, secrets, jobs, deployments, daemonsets, statefulsets, and replicasets. By deploying these cluster roles through the Helm release, the e6data control plane user is equipped with the necessary permissions to effectively manage and interact with the resources within the cluster, enabling seamless operation and configuration of the e6data platform.

Creating a GKE Cluster (optional)

To create a new Google Kubernetes Engine (GKE) cluster, you'll need to have the Google Cloud SDK installed and configured on your local machine. If you don't have it installed, you can follow the instructions at install the gcloud CLI | Google Cloud to set it up.

Once you have the Google Cloud SDK ready, follow these steps to create a new GKE cluster:

  1. Open a terminal or command prompt.

  2. Authenticate with your Google Cloud account by running the following command and following the instructions:

    gcloud auth login

  3. Set the default project for your cluster by running the following command and replacing [PROJECT_ID] with your Google Cloud project ID:

    gcloud config set project [PROJECT_ID]

  4. Run the following command, replacing [CLUSTER_NAME] with your desired name for the GKE cluster and [REGION] with the target region:

    gcloud container clusters create [CLUSTER_NAME] --region [REGION] --num-nodes=1 --workload-pool=[PROJECT_ID].svc.id.goog

    This command creates a GKE cluster with the specified name and configuration.

    • The --num-nodes flag specifies the desired number of default nodes in the cluster.

    • The --workload-pool flag enables Workload Identity for the cluster by specifying the workload pool to use. Replace [PROJECT_ID] with your actual project ID. The project ID is a unique identifier for your Google Cloud project.

    Please note that the cluster creation process may take some time as Google Cloud provisions the necessary resources and sets up the cluster.

For detailed instructions and more advanced configurations, you can refer to the official Google Cloud documentation on cloud container clusters create | Google Cloud CLI Documentation.

  1. Connecting to the GKE Cluster and Installing kubectl

    After the cluster creation is complete, run the following command to retrieve the cluster credentials and configure kubectl to connect to the GKE cluster:

gcloud container clusters get-credentials [CLUSTER_NAME] --region=[REGION]

  • Replace [CLUSTER_NAME] with the name of your GKE cluster and [REGION] with the region where the cluster is located.

  • This command fetches the necessary cluster credentials and configures kubectl to connect to the GKE cluster.

  • Verify the connection to the GKE cluster by running the following command:

kubectl get nodes

This command lists the nodes in your GKE cluster. If the connection is successful, you should see a list of nodes displayed.

  • If you don't have kubectl installed, follow the official Kubernetes documentation to install kubectl on your local machine: Install kubectl

  • Once kubectl is installed, you can verify the installation by running:

kubectl version --short

This command displays the version information of kubectl, indicating that it is successfully installed.

Congratulations! You have now created a GKE cluster. You can now deploy and manage e6data workspace on this cluster using the Terraform script below.

Installing Terraform Locally (optional)

To install Terraform on your local machine, you can follow the steps, adapted from the official HashiCorp Terraform documentation:

  1. Visit the official Terraform website at Terraform by HashiCorp

  2. Navigate to the "Downloads" page or click here to directly access the downloads page.

  3. Download the appropriate package for your operating system (Windows, macOS, Linux).

  4. Extract the downloaded package to a directory of your choice.

  5. Add the Terraform executable to your system's PATH environment variable.

    • Windows:

      1. Open the Start menu and search for "Environment Variables."

      2. Select "Edit the system environment variables."

      3. Click the "Environment Variables" button.

      4. Under "System variables," find the "Path" variable and click "Edit."

      5. Add the path to the directory where you extracted the Terraform executable (e.g., C:\\terraform) to the list of paths.

      6. Click "OK" to save the changes.

    • For macOS and Linux:

      1. Open a terminal.

      2. Run the following command, replacing <path_to_extracted_binary> with the path to the directory where you extracted the Terraform executable: export PATH=$PATH:<path_to_extracted_binary>

      3. Optionally, you can add this command to your shell's profile file (e.g., ~/.bash_profile, ~/.bashrc, ~/.zshrc) to make it persistent across terminal sessions.

  6. Verify the installation by opening a new terminal window and running the following command: terraform version. If Terraform is installed correctly, you should see the version number displayed.

Last updated