04 Apr 2022

7 best practices to modernize data architecture on Databricks with LeapLogic


Abhijit Phatak, Director of Engineering, Impetus

Soham Bhatt, Modernization Practice Lead and Lead Solutions Architect, Databricks

A lakehouse is a new-age, open architecture that brings together the best components of data lakes and data warehouses, enabling enterprises to power a wide range of analytics use cases – from business intelligence (BI) to artificial intelligence (AI).

The Databricks Lakehouse Platform simplifies data architecture, breaks down data silos, and provides a single source of truth for analytics, data science, and machine learning (ML).

While data lakes excel at processing massive amounts of data at low cost, the Databricks Lakehouse addresses data reliability and ease-of-use constraints by adding data warehousing capabilities like:

  • ACID compliance,
  • Fine-grained security,
  • Update/delete/merge functionality,
  • Extensive ANSI SQL support, and
  • A super-fast BI querying engine.

The platform also natively provides direct access to data using open APIs and supports various ML and Python/R libraries like PyTorch, Tensorflow, and XGBoost.

LeapLogic, a cloud transformation accelerator, helps enterprises modernize legacy data warehouses, ETL, SAS, and Hadoop workloads to the Databricks Lakehouse. This blog discusses the best practices for ensuring a seamless transformation journey with LeapLogic.

Simplifying the modernization journey on Databricks with LeapLogic: 7 best practices

1. Strike a balance between ‘lift and shift’ approach and total refactoring

While strategizing the move to Databricks Lakehouse, many enterprises face the dilemma of adopting a ‘lift and shift’ approach or total refactoring.

A simple ‘lift and shift’ may sound intuitive and simple, but the existing technical debt can get carried over to the target. And there is no opportunity to re-architect and refactor the application for the cloud.

At the same time, total refactoring prolongs the time-to-market as it requires re-architecture, additional development time, and extensive training of resources on new processes and technologies.

LeapLogic strikes a fine balance between these two approaches.

It performs a comprehensive assessment to identify the technical debt and dependency structure across workloads, pre-empt performance optimizations, and provide target-specific recommendations. LeapLogic reverse-engineers logic from the code and data model and determines candidates for ‘as-is’ migration and refactoring.

2. Analyze workload usage patterns and dependencies to prioritize use cases

One of the toughest decisions in the modernization journey is finalizing the sequence in which workloads and applications should be migrated to the Databricks Lakehouse.

Performing an in-depth, pattern-based analysis of the existing workloads helps unlock data-driven insights and prioritize high-value business use cases. In addition, a comprehensive, end-to-end data and process lineage helps visualize complex interdependencies between all the workloads in a simplified manner.

Data lineage helps identify the complete chain of input and reference tables that are used to populate the output tables. Additionally, process lineage shows the linkages between scripts, procedures, jobs, and any other piece of code calling/using certain scripts, etc.

LeapLogic automates workload assessment and empowers enterprises with 360-degree visibility - from data collection, ingestion, and ETL to curation and consumption. This, in turn, helps create a robust blueprint for phase-wise migration to the Databricks Lakehouse and eliminates the risks associated with a ‘big bang’ approach.

3. Identify and resolve technical debt in existing systems

To ensure optimal price-performance ratio on Databricks, it is important to identify optimization opportunities within the existing legacy system at schema, code, and orchestration levels during the migration. In addition, the schema needs to be structured and modeled following Delta Lake’s best practices.

For instance, LeapLogic automatically handles the conversion of database indexes and specialized features like Teradata single-AMP joins or Azure Synapse columnar storage and distribution. It converts these to Databricks Delta-compliant DDL with code optimizations such as Delta Lake Z-Ordering, Bloom filter indexing, and partitioning/clustering.

Once the schema is optimized, any anti-patterns in the code (at the file or query level) must be identified and resolved, as these impact query performance adversely, particularly in larger datasets.

For orchestration, the logic in scheduling scripts or jobs may need to be broken down such that independent jobs or queries are executed in parallel. This helps significantly improve performance compared to sequential execution.

4. Plan the future-state architecture

A flexible, optimally designed data architecture helps enterprises efficiently address complex use cases, power advanced analytics, and meet business SLAs.

LeapLogic helps businesses design a resilient, high-performant Lakehouse architecture based on various considerations, including security, cost, and operational efficiencies.

Here’s a closer look at LeapLogic’s step-by-step approach:

Databricks Native Stack

Example of a high-level physical architecture on a Databricks-native stack

  • All on-premises systems (including data warehouses, ETL systems, analytics systems, Hadoop, and flat file sources) are transformed into a Databricks-native stack.
  • Data migration pipelines, Databricks pipelines, and notebooks are created automatically.
  • A dedicated, private connection is established to transfer data securely from the on-premises network leveraging Express Route/AWS Direct Connect/Cloud Interconnect.
  • The workload processing pipelines are then executed using Databricks jobs and workflows/Azure Data Factory/AWS Glue Studio/Cloud Dataflow and the data is ingested into cloud storage (Azure Data Lake Storage/ Amazon S3/Google Cloud Storage).
  • TThe transformed data warehouse and ETL code are automatically validated (through auto-generated reconciliation scripts) and executed via Databricks notebooks. The data is then ingested to Delta Lake.
  • The raw data is processed and made consumption-ready leveraging Databricks ETL workflows. Given the performance improvements of Databricks SQL, most BI reports can run directly on Databricks’ Gold/Presentation zone. This enables self-service analytics across the entire Lakehouse.
  • Cloud authentication and authorization mechanism is implemented to ensure secure access to resources and applications.
  • The data management platform, powered by Databricks Lakehouse, is also integrated with the consumption layer where the aggregated data is used for BI reporting.
  • The data management platform is integrated with Databricks MLflow and logs to cater to ML use cases and manage the end-to-end ML lifecycle.

5. Transform workloads end-to-end, including business logic

Data migration is an important first step of the data estate modernization journey. But to realize the full ROI on modernization, it is vital to migrate the complete source code with the business logic. The reason: the legacy system can be retired, and infrastructure and licensing costs can be recouped.

The complexity lies in migrating and modernizing code like stored procedures, ETL mappings, Shell scripts, SAS code (if any), and orchestration code (Control-M, AutoSys, etc.)

LeapLogic handles these transformations easily and supports Databricks innovations like Delta Live Tables, the new SQL Photon Engine, etc.

Here’s an example of how LeapLogic automatically transforms workloads, end-to-end.

Data warehouse and ETL:

  • Legacy DDL scripts are converted to Delta tables
  • SQL scripts and procedural logic are transformed to a Lakehouse-native form like Databricks SQL, PySpark, or Spark Scala
  • Scheduler/ workflow scripts like Control-M/AutoSys/EPS/Cron Shell are repointed or transformed automatically to Databricks jobs or cloud-native orchestration engines like Airflow or ADF
  • Shell scripts with embedded Teradata blocks are converted to Lakehouse-compatible Shell scripts + Spark SQL
  • Built-in functions of the data warehouse are transformed to Lakehouse-native or custom functions
  • ETL scripts are converted to Lakehouse-native code in PySpark or Scala + Spark SQL
  • Legacy BI/consumption reports are repointed to Databricks SQL
  • Schema and code are optimized with embedded Spark performance tuning parameters

SAS analytical scripts:

  • Legacy DDL scripts are converted to Delta tables
  • SQL scripts and procedural logic are transformed to Lakehouse-native code like PySpark
  • Business logic contained in procedural statements and macros are transformed to Lakehouse-native code like PySpark
  • Built-in SAS PROCs and analytical functions are converted to Lakehouse-native or custom functions using a rich library provided as part of the deliverables
  • SAS analytical workloads like data prep/acquisition (macros, data steps, conditional logic) and advanced PROCs/models are converted to Lakehouse-native code (PySpark or Spark Scala)

The transformed code is then validated and certified on the Databricks Lakehouse using a repeatable, verifiable, and extensible framework.

6. Cutover planning and decommissioning

Modernization initiatives aim at transitioning the legacy system capacity to new-age data architecture and instantly tapping the myriad benefits of Databricks. However, myopic planning can lead to a cutover disaster and completely derail project goals.

LeapLogic helps enterprises implement a robust, phased migration strategy and decommission the migrated code seamlessly after extensive testing and productionization in the new environment.

Decommissioning helps quickly limit cloud costs, ensure on-time project completion, and mitigate any risk of rolling back to a previous version.

7. Operationalize migrated workloads over Databricks Lakehouse-native services

The last leg of migration involves operationalizing the migrated workloads in the new environment using Lakehouse-native services. This enables a smooth transition into production and helps the enterprise go live on Databricks.

Operationalization includes:

  • Setting up DevOps and CI/CD processes,
  • Integrating with third-party tools/services,
  • Configuring proper cluster types and templates,
  • Enabling options for auto-scaling,
  • Ensuring auto-suspension and auto-resumption, and
  • Optimizing the cost-performance ratio.

Enterprises also need to set up security policies and protocols (such as SSL and AD integration), define regulatory compliance standards, and set up tools for lineage, metadata, etc.

LeapLogic takes advantage of implicit data governance and performs these steps automatically through infrastructure as code, while adhering to a specified project management methodology (like Agile).

To accelerate DevOps, it handles defects, ensures code packaging and check-ins into SCM, enables version control, and manages universal artifacts.

Final thoughts

The promise of the cloud is compelling IT and data teams to take a hard look at their current systems and evaluate what they need to make this promise a reality. However, the path to modernization can be challenging and disruptive.

Most companies are concerned about the decades of efforts and investments they have put in for writing and refining legacy system code. Also, with such legacy systems, there is often inadequate understanding of the complete logic, with only partial documentation available.

The need for custom implementation of user-defined functions and keywords only adds to the complexity. This is where migration accelerators like LeapLogic help customers modernize their legacy data warehouse, ETL, and analytics with ~70-90% automated code conversion.

LeapLogic has helped several Fortune 500 enterprises transform to Databricks faster, with more accuracy, and experience the freedom of the cloud without any business disruption.

You may also be interested in…


11 Mar 2022

Why is automated validation critical for end-to-end data platform modernization?

Data platform modernization is typically time-consuming and complex, requiring extensive expertise and resources. Besides migrating SQL and data, enterprises need to transform legacy workloads like queries, ETL, applications, reporting, and analytics to the chosen cloud environment or modern data platform.

The modernization journey does not end there. After migration, enterprises must ensure that all workloads perform optimally to meet the business SLAs in the target environment. Therefore, validating the converted code and business logic is important to resolve any issues or errors that might lead to business disruption.

This blog explores the importance of validation in the modernization journey, and how automating it can lead to better business and financial outcomes.

Why code validation is a Herculean task

Most migration projects go way beyond a simple ‘lift and shift.’ All legacy workloads cannot simply be moved as-is to the target environment – some need additional optimization, while others need complete re-engineering for efficient resource utilization on the cloud.

The migrated applications also need to support all use cases in the target environment, for which each use case must be validated on a live dataset. However, manual validation of the migrated code and applications on the target environment is extremely tedious.

Here’s a glimpse of the complexities at play:

  • Disparate code types – ETL workflows, orchestrator scripts, procedural logic, etc.
  • Complex business logic
  • Complex conditional logic in queries
  • Multiple enterprise scenarios and corner cases

Automated exhaustive quality and data-based checks of workloads across their lifecycle can help engineering teams drastically reduce manual effort and ensure a seamless transition into production.

Ensuring validation at the minutest level

Incorrectly loaded jobs, inaccurate logic, uncatered corner cases, or other errors can impact data points and insights. Enterprises need an end-to-end automated transformation solution that validates and certifies all workloads at the cell level to ensure data consistency and parity between source and target.

The automated validation process auto-generates reconciliation scripts by comparing data between the original and impacted tables, aggregate functions such as AVG, SUM, MIN, MAX, etc., and cell-to-cell comparison and reporting. This ensures that the converted codes are production-ready.

Automated validation of migrated workloads in LeapLogic
Automated validation of workloads using LeapLogic

Addressing security concerns

Testing the migrated data and transformed queries is integral to ensuring their optimal performance in the target environment. However, for compliance and regulatory reasons, many enterprises choose not to share their sensitive data for testing.

An automated transformation solution can address this challenge by generating a sample dataset with tens of unique records based on the exact query conditions and validating these with 100% accuracy. This is ideal for unit testing of the transformed queries.

In addition, such a solution can feed the customer-provided dataset for testing real datasets, which is more suitable for integration testing of transformed queries.

In a nutshell

Many data platform modernization projects fail because the migrated workloads cannot perform optimally in the target environment. While validation might seem like a minor part of the process, it is significant to ensure the end-to-end operationalization of the platform.

LeapLogic, Impetus’ cloud accelerator, simplifies and de-risks the entire process, enabling enterprises to smoothly transition into production and go live on the target with confidence. It automates the end-to-end transformation process – right from assessment to operationalization, powering modernization from any legacy system to any cloud-native stack or modern data platform.

To learn more, book a demo or start your free trial today.

You may also be interested in…


07 Mar 2022

Tackling the top 5 business risks of data landscape modernization

The cloud empowers enterprises with on-demand scalability, flexibility, and cost benefits, enabling them to respond to fast-changing business requirements and fuel growth.

According to Gartner1, more than 70% of companies have now migrated at least some workloads into the public cloud. While most IT and data teams are under immense pressure to migrate their data warehouse, ETL, and analytics quickly to the cloud, the path to modernization remains complex and risky. This blog outlines the top 5 associated business risks and how to address them.

#1 Business disruption

For most enterprises planning to move to the cloud, the fear of business disruption is overwhelming.

  • Will the end user experience be impacted?
  • Will our day-to-day operations continue to run smoothly?
  • Will there be any downtime or data loss?

These questions play heavily on the minds of infrastructure and operations leaders. A deeper dive reveals the underlying reasons behind these apprehensions – inadequate documentation, lack of in-house expertise, highly complex and technical systems, incomplete understanding of existing code logic, etc.

An automated assessment of legacy workloads can provide complete visibility into the existing codebase, including inventory profiling, identification of workload interdependencies and data/process lineage, key resource utilization metrics, and query complexity assessment.

These insights help enterprises choose the right transformation candidates, optimize their workloads for the target, plan the migration in phases, and ensure a seamless transition. Our assessment and transformation accelerator, LeapLogic, recently helped an American retail company convert Netezza and Informatica to an Azure-Databricks stack and operationalize the migrated workloads without any hitches in just 16 weeks.

#2 Spiralling costs

While strategizing the move to the cloud, most enterprises are concerned about their existing investments, not to mention the years of effort they have put into writing business logic, rules etc. To make the most of existing investments, it is important to reuse legacy workloads like data, analytics, and DML, ETL, orchestrator and reporting scripts, wherever possible.

Intelligent automated solutions can help reuse your existing investments by smartly converting diverse workloads and migrating the schema and data to the target platform.

Another common concern is technical debt. Design and code defects stacked in the legacy system over years can snowball into a death-spiral of technical debt, which causes multiple operational issues. Modularizing the architecture is one of the common techniques for avoiding technical debt.

For this, it is important to identify dependencies between the workloads at the process and data level, which is highlighted in lineage. Automated workload assessment and transformation with LeapLogic helped a telecom giant automate Teradata modernization and save millions of dollars.

#3 Inability to meet SLAs

Workloads transformed to the target-equivalent may not always perform well enough to meet business or technical SLAs. Often, even if they can be executed in the new environment, they fail to perform optimally on the target tech stack and architecture. This in turn impacts production SLAs, business decisions, costs, and the overall time-to-market.

To meet performance SLAs and control costs, focus on optimizing the price-performance ratio during migration. Start by identifying technical debt and anti-patterns in the source systems and code. These need to be resolved and optimized at the orchestration level to enable parallel execution as compared to sequential execution (which can lead to delays).

Advanced tools for automated transformation provide optimization recommendations for the target at the schema, code, and orchestration level. These recommendations are based on the enterprise’s workload types (ETL-heavy/consumption-heavy, etc.) and business goals.

A Fortune 500 global enterprise technology provider was able to improve SLAs by 20% through automated migration of legacy Teradata workloads to a modern data platform.

#4 Security loopholes

Transferring data to the cloud potentially carries many security risks, such as insider threats, accidental errors, external attacks, malware, misconfigured servers, insecure APIs, compliance breaches, etc.

A transformation partner can help implement best practices, define regulatory compliance standards, and set up effective policies and protocols for encryption/masking, authentication, and authorization. The migrated data and transformed queries also need to be tested and validated in the new environment.

However, for compliance and regulatory reasons, many enterprises choose not to share their sensitive data for testing. An automated migration accelerator can address this challenge by generating a sample dataset with tens of unique records based on the exact query conditions and validating these with 100% accuracy. In addition, the solution can feed the customer-provided dataset for testing on real datasets, which is suitable for integration testing of transformed queries.

#5 Project delays

For any modernization initiative, it is critical to clearly define the scope of migration and prioritize use cases with high business value. Segregating the workloads into units for as-is migration, optimization, and complete refactoring can help transform workloads in sprints and avoid schedule overruns. It is equally important to choose the right migration partner and toolset to ensure on-time completion.

Many organizations try to save on costs by giving the migration project to an internal team, even if they are not ready for it. This often leads to errors and rework, increasing costs, and delays in implementation.

To ensure successful cloud modernization, people at every level across the organization need to align on goals based on business value. All stakeholders must have a holistic view of the project scope and understand the responsibilities of the internal teams as well as the migration partner. Relevant in-house personnel should be aware of the industry’s latest tools, best practices, and reference implementations. Re-skilling and upskilling also need to be continuous processes.

Additionally, enterprises must ensure effective program and portfolio management by integrating IT governance with organizational governance.

Final thoughts

Large-scale data estate modernization projects can be extremely challenging because of the number of unpredictable variables involved. A manual approach can get the job done, but it requires substantial time commitments and comes with a high risk of human error.

Automated transformation accelerators like LeapLogic can help businesses transform both legacy logic and code, and journey to the cloud with the highest level of accuracy and minimal disruption.

To know more, book a demo today.


You may also be interested in…


12 Nov 2021

3 reasons why lineage should be an integral part of your modernization journey to Databricks

To implement trust-based governance, enterprises should acknowledge the different lineage and curation of assets. – Gartner

A comprehensive, end-to-end data and process lineage is quintessential for effectively planning the migration of legacy workloads to the Databricks Lakehouse. Lineage helps visualize complex interdependencies between all the workloads in a simplified and intelligible manner. For example, data lineage helps identify the complete chain of input and reference tables that are used to populate the output tables. Additionally, process lineage shows the linkages between scripts, procedures, jobs, and any other piece of code calling/using certain scripts, etc. This blog takes a closer look at why lineage plays a critical role in the modernization of ETL, data warehouse, and analytics systems to the Databricks Lakehouse.

#1 Enables end-to-end analysis of the existing codebase

Typically, even within one line of business, multiple data sources, entry points, ETL tools, and orchestration mechanisms exist. Decoding this complex data web and translating it into a simple visual flow can be extremely challenging during large-scale modernization programs. Here’s where a visual lineage graph adds tremendous value and helps define the roadmap to the Databricks Lakehouse. It deep dives into all the existing flows, like Autosys jobs, applications, ETL scripts, BTEQ/Shell (KSH) scripts, procedures, input and output tables and provides integrated insights. These insights help data teams take strategic decisions with greater accuracy and completeness. Enterprises can proactively leverage this integrated analysis to mitigate the risks associated with migration and avoid business disruption.

End-to-end lineage on LeapLogic, an accelerator for automated cloud transformation
End-to-end lineage on LeapLogic, an accelerator for automated cloud transformation

#2 Helps segregate workloads into logical units for migration

Advanced tools supporting an end-to-end modernization charter provide sophisticated assessment capabilities like interactive lineage (a graph) using a repeatable and verifiable DAG-based approach. This helps segregate interdependencies into logical units based on the level of business-criticality of applications/users. Enterprises can then scope their migration to the Databricks Lakehouse into logical phases and offload legacy workloads in a phased manner, rather than adopting a ‘big bang’ approach, which involves far greater risk. IT leaders can make informed decisions on which workloads can be migrated as-is, which ones need optimization, and which ones need complete re-engineering.

#3 Provides 360-degree visibility – from origination to reporting

Lineage also helps trace the complete data flow in steps. The visual representation helps track how the data hopped and transformed along the way, with details of parameter changes, etc. Automated transformation tools that provide backward and forward lineage show the hops and reference points of data lineage with complete audit trail of the data points of interest. They also help temporarily ignore peripheral, unwanted data points like system tables, bridge tables, etc. and establish useful correlations between data points to uncover hidden patterns and results. Users can also deep dive into certain flows leveraging next-gen features like contextual search, advanced filters according to workload type, an interactive visual interface, graph depth, and more. These capabilities help businesses plan a seamless migration and modernize their ETL, data warehouse, and analytics systems faster to the Databricks Lakehouse stack, with lower cost and lower risk.

You may also be interested in…