Hadoop modernization to Snowflake enabled a single source of truth
Migrated 800+ production-grade Hive scripts with 75% automation in less than 2 months
Challenge
The existing data warehouse of a global specialty retail company was choking due to increasing volumes of data. Further, data in multiple formats like Hive scripts, BTEQs, DSS jobs, Talend jobs, and Spark had resulted in a data swamp with duplicate entries. To address these challenges, the retail company wanted to modernize their Hadoop data platform to Snowflake on Azure. They were looking to:
- Reduce OpEx
- Save data storage and analytics costs
- Improve scalability
- Create a unified source of truth for easy access and faster analytics
However, migrating diverse Hadoop ecosystem services can be complex, time-consuming, and risky. Therefore, the retailer was looking for a partner to automate modernization and help:
- Migrate 800+ Hive scripts into Snowflake – Python, Snow blocks, and XML-compatible SQLs
- Assess and migrate all legacy Hive, DDL, DML, and queries to equivalent Snowflake workloads
- Update column name and alias in the target table with a generic name for easy sorting
- Identify and remove duplicate entries
- Incorporate data model changes and schema modernization
- Provide transparent access to downstream users consuming Snowflake data
- Ensure a smooth transition from Hadoop to Snowflake with seamless integration
Solution
Impetus partnered with the retailer to ensure seamless, automated migration of 800+ production-grade Hive scripts containing 1600+ DDL and DQL queries to Snowflake.
Assessment
To ensure risk-free accelerated migration, we assessed the entire inventory of legacy workloads:
- Classified infrastructure, data and storage, and memory utilization variance
- Forecasted capacity and mapped the services for the target
- Classified workloads based on service mappings and identified candidates to rehost, refactor, or rebuild
- Provided recommendations based on matching capacity or burst capacity
The assessment revealed ~40% simple scripts (which could be migrated with minor changes in column names and syntax), 30% scripts with medium complexity, and 30% highly complex scripts.
Transformation
The team used LeapLogic’s ingestion framework to ingest data from Hive to Snowflake automatically. For conversion, we used LeapLogic frameworks like:
- Apache Calcite framework for conversion of Hive to SnowSQL
- XML generation framework to convert SQLs to XMLs
- XML execution framework to convert all queries and jobs automatically to XML
- Custom framework to execute the SQL queries
In addition, we also created and leveraged various utilities:
- Created a utility that automatically generates XMLs of Hive scripts
- Used a Python utility to automatically update column names, DDL scripts, alias, and schema names of 200+ columns with 90% accuracy
Validation and consolidation
Next, we consolidated tables from the enterprise data warehouse track and incorporated them in MapR. For cell-level validation, we used LeapLogic’s validation framework, which saved 75% manual effort. We identified the primary keys in case of mismatch (which is extremely challenging as there are no primary keys in Hive scripts) and addressed the mismatch to ensure highly accurate conversion with <10 defects across 800+ scripts.
Impact
The solution provided a significantly automated low-cost Hadoop migration pathway for all legacy dependencies. Teradata, DataStage, and Hadoop workloads were consolidated and migrated to Snowflake on Azure to provide a single source of truth.
The transformation enabled the retailer to retire their Hadoop infrastructure seamlessly and migrate to a highly scalable and performant environment on Snowflake. It also helped them consolidate foundational data and create a unified source of truth with easy access and advanced analytics capabilities.