25% Cost Savings with SAS to Azure Databricks Migration

25% cost savings and 40% faster time-to-market with automated SAS to Azure Databricks migration

Seamlessly transitioned 1.8 million lines of code and 526 Enterprise Guide Projects to Azure Databricks with over 70% automation, enhancing efficiency and analytics capabilities

Business needs

A leading American market research company sought to migrate its legacy SAS platform, which comprised of approximately 5,000 SAS files with 1.8 million lines of code and 526 Enterprise Guide Projects (EGPs) to an Azure Databricks solution.

The SAS platform supported critical business applications, including statistical and algorithmic modeling, descriptive and diagnostic analytics, data management and integration, marketing and customer analytics, and forecasting and strategic planning.

The migration aimed to:

Reduce operational expenses by eliminating SAS license fees
Match business SLAs of the Databricks-based solution with the legacy SAS environment
Implement a fully cloud-native analytics stack
Future-proof business by upgrading advanced analytics capabilities
Achieve operational excellence while reusing the existing investments

Key requirements:

Assess existing SAS scripts to identify relevant inventory, code complexity, optimization opportunities, etc., to develop a robust migration plan
Build PySpark modules in Databricks to replace SAS programs and Enterprise Guide Projects
Integrate job orchestration to replicate current SAS triggers, outputs, and process alerts
Validate cell-to-cell data matches for all outputs and address exceptions where necessary
Optimize performance to meet or exceed current SAS standards
Re-point and redevelop ~1,500 reports, including Excel macros, pivot tables, and text reports
Enhance migrated code by removing hardcoding and introducing parameterization

Eliminated 15% redundant scripts, optimizing workflows

Solution

The migration team used Impetus LeapLogic, an automated cloud migration accelerator, to migrate SAS workloads to Azure Databricks. The transformation, which involved 70% auto-conversion of workloads, involved the following steps:

Comprehensive assessment of all the SAS programs and EGP files to list the entire inventory, evaluate complexity distribution across files, and more
Creation of an effective migration plan with time and cost estimates
Integration of auto-converted SAS programs into the Databricks environment through CI/CD pipelines
Orchestration using Azure Data Factory (ADF) to trigger pipelines for each business process
Data retrieval from on-premises SAS server via linked services during pipeline execution
Input data transfer from on-premises SAS server to predefined Blob storage directories, following SAS program paths
Execution of auto-converted Databricks notebooks via ADF pipelines to generate the desired output
Ad-hoc data movement on Databricks File System (DBFS) and Delta tables, with final outputs stored in Blob storage (TSV, XLS, XLSB, PDF, etc.)
Migration of final output to the SAS server for validation until decommissioning
Storage of application logs in the Log Analytics workspace to support job profiling and operations
Automated validation of input/output availability and capture of ad-hoc SAS datasets expediting production readiness of converted programs
Utilization of more than 200 reusable function libraries to accelerate development
Development and implementation of a custom framework for data validation, supporting both single and batch validation across various data sources and formats (Excel files, CSV, Delta tables, etc.)

Figure 1: Target solution architecture

Figure 2: Solution overview

Solution Highlights

Leveraged LeapLogic to automatically assess, automate, and accelerate the conversion of SAS programs into Databricks-compatible PySpark programs
Performed a detailed automated assessment of a few million lines of code to evaluate workload complexity and plan the migration
- Identified the migration scope as 526+ SAS Enterprise Guide Projects (EGP) and ~5,000 SAS programs with heterogeneous data sources
- Approximately 75% of the files were assessed for high to very high complexity
Used LeapLogic’s automation capabilities to rapidly convert code, validate data, and build production-ready pipelines
Deployed a large library of ready-to-use PySpark functions covering common SAS patterns and complex statistical operations
Converted 1,000+ macros into PySpark functions
Facilitated static and dynamic data movement between on-premises and cloud storage for TSV, CSV, and XLSX formats through Azure Data Factory (ADF) and Azure DevOps (ADO) for applicable processes
Automated orchestration generation using JSON-driven templates, creating an automation suite based on shell scripting/Python to generate JSON-based ADF pipeline templates. These templates were used for process execution, including input/output data movement and code execution.
Implemented custom solutions for multiple SAS-specific machine learning procedures, such as PROC FASTCLUS, URC, DISCRIM (discriminant analysis), and STEPDISC (stepwise discriminant analysis)
Developed Databricks Notebooks to orchestrate the entire workflow of interdependent SAS files
Developed a Python-based validation framework for automated cell-level validation of final and intermediate datasets

Automated migration of 5000 SAS scripts with 1.8 million lines of code to PySpark

Impact

The transition to an Azure Databricks-based platform enabled the market research company to significantly reduce costs, streamline workflows, shorten time-to-market, and enhance the efficiency and performance of their analytics infrastructure.

20–25% reduction in operational costs for large workflows
70% of SAS workloads auto-converted to Azure Databricks
30–40% reduction in time-to-market for critical long-running workflows and reports
10–15% of redundant scripts eliminated, removing inefficiencies and optimizing workflows
20% reduction in maintenance efforts through parameterizing workflows