23rd May 2023

Summary

The 23rd May 2023 release of e6data includes the following features & enhancements:

Usage Notes

To get access to all features from this release:

  • Suspend and resume active Clusters

Platform Release Notes

WorkspaceAdmin Role

A new default role named WorkspaceAdmin has been added. This role allows easier creation of users who should only have access to manage workspaces and, view the catalogs & clusters associated with a workspace.

More information: Roles

Multi-Catalog Support

Users can now select multiple catalogs to be connected to a cluster when creating/editing a cluster, instead of adding them individually.

Engine Release Notes

SQL Optimizations

Support for the collect_list Function

e6data now supports the collect_list function. This powerful function enhances your data analysis capabilities, allowing more effective aggregation and windowing of data.

  • The collect_list function enables you to create an ordered list of values from a specified column within a group or window.

  • It is particularly useful when you need to gather and process multiple values associated with a particular grouping key or window frame.

More information: COLLECT_LIST()

Cross-Schema & Cross-Catalog Querying

We now support cross-schema and cross-catalog querying! This new capability allows you to access and analyze data across different schemas and catalogs within your data lakehouse environment.

  • This feature lets you combine and analyze data from different schemas and catalogs, providing a comprehensive view of your data landscape.

  • To perform cross-schema queries, you need to fully qualify the path by providing the catalog name, schema name, and table name.

    • Hive Catalog: <catalog_name>.<db_name>.<table_name> catalog_name - name of catalog specified during catalog creation in e6data

    • Glue Catalog: <catalog_name>.<db_name>.<table_name> catalog_name - name of catalog specified during catalog creation in e6data

    • Unity Metastore: <unity_catalog_name>.<db_name>.<table_name> unity_catalog_name - name of catalog which is created under Unity in Databricks

More Information: Cross-Catalog & Cross-Schema Querying

Bug Fixes & Improvements

We have added support for querying multi-level partitioned columns in Glue catalogs, which was previously unsupported. This unlocks the full potential of the e6data query planning and optimization when using Glue catalogs.

  • You can now seamlessly query and analyze data stored in multi-level partitioned columns.

  • We utilize the partitioning information to optimize query execution plans. This optimization ensures that queries involving multi-level partitioned columns are processed efficiently, resulting in improved performance.

Know Limitations

  1. Cross-schema & cross-catalog querying have a few limitations, as listed below:

    1. Slight Increase in Parsing Time: Due to the additional complexity introduced by cross-schema and cross-catalog querying, there may be a slight increase in parsing time for initial queries. However, subsequent queries benefit from query plan caching and performance optimizations

    2. Fully Qualified Path Requirement: When performing cross-schema or cross-catalog queries, it is essential to provide the entire path, including the catalog name, schema name, and table name. For example: <catalog-name>.<schema-name>.<table-name>

    3. Same Account and Region for Cross-Catalog Queries: To query tables across catalogs, the catalogs must reside within the same account and region. This limitation ensures data security and optimal performance.

  2. COLLECT_LIST() function has a few limitations, as listed below:

    1. Unsupported Complex Data Types: The COLLECT_LIST function currently does not support complex data type columns as input. It is designed to work with simple data types such as integers, strings, or dates.

    2. Absence of Sorting Capability: The COLLECT_LIST function does not allow for sorting operations, such as using the ORDER BY clause. The resulting list is generated based on the order of appearance of the values in the dataset, without the ability to explicitly control the ordering.

    3. Incompatibility with Distinct Operation: The COLLECT_LIST function cannot be used in conjunction with the DISTINCT operation. It is intended to gather all values within a specified grouping or windowing context, without eliminating duplicates or applying distinctness.

    4. Unsupported Union Operation with Window Functionality: The COLLECT_LIST function cannot be directly combined with the UNION operation when used within a window function. Unioning the results of COLLECT_LIST across different window partitions is currently not supported.

    5. Null Values Omitted from the List: When using the COLLECT_LIST function, null values are automatically excluded from the resulting list. The function gathers and returns non-null values only.

  3. Please take note of the following restrictions when working with complex data types:

    1. Struct Limitations:

      1. Nested Structs: We do not support nested struct types, such as an array of structs. Querying the entire struct or individual leaf fields is supported, but querying intermediate levels within a struct is not.

      2. Array Inside a Struct and Struct inside a struct: Arrays and structs can be used within a struct and are supported.

      3. Standard Operations and Functions: Standard operations and functions can be applied to primitive struct types, allowing you to manipulate and analyze the data. However, operations, like joins, group by, order by, and union all involving structs, are only supported for leaf nodes.

    2. Array Limitations:

      1. Non-Primitive Arrays: Non-primitive arrays, including arrays of arrays, arrays of structs, and nested complex types within arrays, are not supported.

  4. Queries that have the “Nested With clause having a limit” pattern will face parsing issues. Example of a type of that query:

WITH cte2 AS( WITH cte AS
(
           SELECT     d_year,
                      Sum(ss_net_profit) AS ss_net_profit
           FROM       store_sales ss
           INNER JOIN date_dim dd
           ON         ss.ss_sold_date_sk=dd.d_date_sk
           GROUP BY   1
           ORDER BY   1,
                      2)
SELECT   *
FROM     cte
ORDER BY 1 limit 2)
SELECT   d_year,
         avg(ss_net_profit)
FROM     cte2
GROUP BY 1
ORDER BY 1

Last updated