25% cost savings and 40% faster time-to-market with automated SAS to Azure Databricks migration
Seamlessly transitioned 1.8 million lines of code and 526 Enterprise Guide Projects to Azure Databricks with over 70% automation, enhancing efficiency and analytics capabilities
Business needs
A leading American market research company sought to migrate its legacy SAS platform, which comprised of approximately 5,000 SAS files with 1.8 million lines of code and 526 Enterprise Guide Projects (EGPs) to an Azure Databricks solution.
The SAS platform supported critical business applications, including statistical and algorithmic modeling, descriptive and diagnostic analytics, data management and integration, marketing and customer analytics, and forecasting and strategic planning.
The migration aimed to:
- Reduce operational expenses by eliminating SAS license fees
- Match business SLAs of the Databricks-based solution with the legacy SAS environment
- Implement a fully cloud-native analytics stack
- Future-proof business by upgrading advanced analytics capabilities
- Achieve operational excellence while reusing the existing investments
Key requirements:
- Assess existing SAS scripts to identify relevant inventory, code complexity, optimization opportunities, etc., to develop a robust migration plan
- Build PySpark modules in Databricks to replace SAS programs and Enterprise Guide Projects
- Integrate job orchestration to replicate current SAS triggers, outputs, and process alerts
- Validate cell-to-cell data matches for all outputs and address exceptions where necessary
- Optimize performance to meet or exceed current SAS standards
- Re-point and redevelop ~1,500 reports, including Excel macros, pivot tables, and text reports
- Enhance migrated code by removing hardcoding and introducing parameterization
Solution
The migration team used Impetus LeapLogic, an automated cloud migration accelerator, to migrate SAS workloads to Azure Databricks. The transformation, which involved 70% auto-conversion of workloads, involved the following steps:
- Comprehensive assessment of all the SAS programs and EGP files to list the entire inventory, evaluate complexity distribution across files, and more
- Creation of an effective migration plan with time and cost estimates
- Integration of auto-converted SAS programs into the Databricks environment through CI/CD pipelines
- Orchestration using Azure Data Factory (ADF) to trigger pipelines for each business process
- Data retrieval from on-premises SAS server via linked services during pipeline execution
- Input data transfer from on-premises SAS server to predefined Blob storage directories, following SAS program paths
- Execution of auto-converted Databricks notebooks via ADF pipelines to generate the desired output
- Ad-hoc data movement on Databricks File System (DBFS) and Delta tables, with final outputs stored in Blob storage (TSV, XLS, XLSB, PDF, etc.)
- Migration of final output to the SAS server for validation until decommissioning
- Storage of application logs in the Log Analytics workspace to support job profiling and operations
- Automated validation of input/output availability and capture of ad-hoc SAS datasets expediting production readiness of converted programs
- Utilization of more than 200 reusable function libraries to accelerate development
- Development and implementation of a custom framework for data validation, supporting both single and batch validation across various data sources and formats (Excel files, CSV, Delta tables, etc.)
Solution Highlights
- Leveraged LeapLogic to automatically assess, automate, and accelerate the conversion of SAS programs into Databricks-compatible PySpark programs
- Performed a detailed automated assessment of a few million lines of code to evaluate workload complexity and plan the migration
- Identified the migration scope as 526+ SAS Enterprise Guide Projects (EGP) and ~5,000 SAS programs with heterogeneous data sources
- Approximately 75% of the files were assessed for high to very high complexity
- Used LeapLogic’s automation capabilities to rapidly convert code, validate data, and build production-ready pipelines
- Deployed a large library of ready-to-use PySpark functions covering common SAS patterns and complex statistical operations
- Converted 1,000+ macros into PySpark functions
- Facilitated static and dynamic data movement between on-premises and cloud storage for TSV, CSV, and XLSX formats through Azure Data Factory (ADF) and Azure DevOps (ADO) for applicable processes
- Automated orchestration generation using JSON-driven templates, creating an automation suite based on shell scripting/Python to generate JSON-based ADF pipeline templates. These templates were used for process execution, including input/output data movement and code execution.
- Implemented custom solutions for multiple SAS-specific machine learning procedures, such as PROC FASTCLUS, URC, DISCRIM (discriminant analysis), and STEPDISC (stepwise discriminant analysis)
- Developed Databricks Notebooks to orchestrate the entire workflow of interdependent SAS files
- Developed a Python-based validation framework for automated cell-level validation of final and intermediate datasets
Impact
The transition to an Azure Databricks-based platform enabled the market research company to significantly reduce costs, streamline workflows, shorten time-to-market, and enhance the efficiency and performance of their analytics infrastructure.
- 20–25% reduction in operational costs for large workflows
- 70% of SAS workloads auto-converted to Azure Databricks
- 30–40% reduction in time-to-market for critical long-running workflows and reports
- 10–15% of redundant scripts eliminated, removing inefficiencies and optimizing workflows
- 20% reduction in maintenance efforts through parameterizing workflows
