Hadoop modernization to Snowflake enabled a single source of truth

Migrated 800+ production-grade Hive scripts with ~85% automation in less than 2 months

Challenge

The existing data warehouse of a global specialty retail company was choking due to increasing volumes of data. Further, data in multiple formats like Hive scripts, BTEQs, DSS jobs, Talend jobs, and Spark had resulted in a data swamp with duplicate entries. To address these challenges, the retail company wanted to modernize their data platform to Snowflake on Azure. They were looking to:

Reduce operating expenses
Save data storage and analytics costs
Improve scalability
Create a unified source of truth for easy access and faster analytics

However, migrating diverse Hadoop ecosystem services can be complex, time-consuming, and risky. Therefore, the retailer was looking for a partner to automate Hadoop modernization and help:

Assess and migrate all legacy Hive, DDL, DML, and queries to equivalent Snowflake workloads
Convert and validate 800+ Hive scripts into Snowflake – Python, Snow blocks, and XML-compatible SQLs
Update column name and alias in the target table with a generic name for easy sorting
Identify and remove duplicate entries
Incorporate data model changes and schema modernization
Provide transparent access to downstream users consuming Snowflake data
Ensure a smooth transition from Hadoop to Snowflake with seamless integration

Highly accurate Snowflake conversion with <10 defects across 800+ Hive scripts

Solution

Impetus partnered with the retailer to ensure seamless, automated migration of 800+ production-grade Hive scripts containing 2500+ DDL and DQL queries to Snowflake.

Assessment

To ensure risk-free accelerated migration, we assessed the entire inventory of legacy workloads to:

Classify infrastructure, data and storage, and memory utilization variance
Identify top workloads by resource consumption, duration of execution, and occurrences during the discovery period
Classify scripts and queries based on the complexity involved at the source to get a detailed understanding before starting the migration
Classify scripts and queries based on the type of changes involved and identify candidates to rehost, refactor, or rewrite

The assessment revealed ~40% simple scripts (which could be migrated with minor changes in column names and syntax), 30% scripts with medium complexity, and 30% highly complex scripts.

Created a custom utility to automatically update 200+ columns with 90% accuracy

Transformation

The team used LeapLogic’s ingestion framework to ingest data from Hive to Snowflake automatically. For transformation, we used LeapLogic frameworks like:

Leaplogic conversion engine to auto-convert Hive queries to SnowSQL
Custom XML generation framework to convert SQLs to XMLs
Custom XML execution framework to convert all queries and jobs automatically to XML and execute the converted SnowSQL queries on Snowflake
Custom framework to execute the SQL queries

In addition, we also created and leveraged various utilities to automatically:

Generate XMLs of Hive scripts
Update column names, DDL scripts, alias, and schema names of 200+ columns with 90% accuracy

Validation and consolidation

Next, we consolidated tables from the enterprise data warehouse track and incorporated them in MapR. Then, we used LeapLogic’s validation framework for cell-level validation, saving 75% of manual effort. Finally, we identified the primary keys in case of mismatch (which is challenging, as there are no primary keys in Hive scripts) and addressed them to ensure highly accurate conversion with <10 defects across 800+ scripts.

Impact

The solution provided a significantly automated risk-free Hadoop migration pathway for all legacy dependencies. In addition to Hadoop, Teradata and DataStage, workloads were also consolidated and migrated to Snowflake on Azure using LeapLogic.

The transformation enabled the retailer to retire their Hadoop infrastructure seamlessly and migrate to a highly scalable and performant environment on Snowflake. It also helped them consolidate foundational data and create a unified source of truth with easy access and advanced analytics capabilities.