7 best practices to modernize data architecture on Databricks with LeapLogic


Authors:

Abhijit Phatak, Director of Engineering, Impetus

Soham Bhatt, Modernization Practice Lead and Lead Solutions Architect, Databricks


A lakehouse is a new-age, open architecture that brings together the best components of data lakes and data warehouses, enabling enterprises to power a wide range of analytics use cases – from business intelligence (BI) to artificial intelligence (AI). The Databricks Lakehouse Platform simplifies data architecture, breaks down data silos, and provides a single source of truth for analytics, data science, and machine learning (ML). While data lakes excel at processing massive amounts of data at low cost, the Databricks Lakehouse addresses data reliability and ease-of-use constraints by adding data warehousing capabilities like ACID compliance, fine-grained security, update/delete/merge functionality, extensive ANSI SQL support, and a super-fast BI querying engine. The platform also natively provides direct access to data using open APIs and supports various ML and Python/R libraries like PyTorch, Tensorflow, and XGBoost.

LeapLogic, a cloud transformation accelerator, helps enterprises modernize legacy data warehouses, ETL, SAS, and Hadoop workloads to the Databricks Lakehouse. This blog discusses the best practices for ensuring a seamless transformation journey with LeapLogic.

  1. Strike a balance between ‘lift and shift’ and total refactoring
  2. While strategizing the move to Databricks Lakehouse, many enterprises face the dilemma of adopting a ‘lift and shift’ approach or total refactoring. A simple ‘lift and shift’ may sound intuitive and simple, but the existing technical debt can get carried over to the target, and there is no opportunity to re-architect and refactor the application for the cloud. At the same time, total refactoring prolongs the time-to-market as it requires re-architecture, additional development time, and extensive training of resources on new processes and technologies. LeapLogic can help strike a fine balance between these two approaches. It performs a comprehensive assessment to identify the technical debt and dependency structure across workloads, pre-empt performance optimizations, and provide target-specific recommendations. LeapLogic reverse-engineers logic from the code and data model and determines candidates for ‘as-is’ migration and refactoring.

  3. Analyze workload usage patterns and dependencies to prioritize use cases
  4. One of the toughest decisions in the modernization journey is finalizing the sequence in which workloads and applications should be migrated to the Databricks Lakehouse. Performing an in-depth, pattern-based analysis of the existing workloads helps unlock data-driven insights and prioritize high-value business use cases. In addition, a comprehensive, end-to-end data and process lineage helps visualize complex interdependencies between all the workloads in a simplified manner. Data lineage helps identify the complete chain of input and reference tables that are used to populate the output tables. Additionally, process lineage shows the linkages between scripts, procedures, jobs, and any other piece of code calling/using certain scripts, etc. LeapLogic automates workload assessment and empowers enterprises with 360-degree visibility – right from data collection, ingestion, and ETL to curation and consumption. This, in turn, helps create a robust blueprint for phase-wise migration to the Databricks Lakehouse and eliminates the risks associated with a ‘big bang’ approach.

  5. Identify and resolve technical debt in existing systems
  6. To ensure optimal price-performance ratio on Databricks, it is important to identify optimization opportunities within the existing legacy system at schema, code, and orchestration levels during the migration. The schema needs to be structured and modeled following Delta Lake best practices. For instance, LeapLogic automatically handles conversion of database indexes and specialized features like Teradata single-AMP joins or Azure Synapse columnar storage and distribution. It converts these to Databricks Delta-compliant DDL with code optimizations such as Delta Lake Z-Ordering, Bloom filter indexing, and partitioning/clustering. Once the schema is optimized, any anti-patterns in the code (at the file or query level) must be identified and resolved, as these impact query performance adversely, particularly in larger datasets. For orchestration, the logic in scheduling scripts or jobs may need to be broken down such that independent jobs or queries are executed in parallel. This helps significantly improve performance compared to sequential execution.

  7. Plan the future-state architecture
  8. A flexible, optimally designed data architecture helps enterprises efficiently address complex use cases, power advanced analytics, and meet business SLAs. LeapLogic helps businesses design a resilient, high-performant Lakehouse architecture based on various considerations, including security, cost, and operational efficiencies. Here’s a closer look at LeapLogic’s step by step approach:

    • All on-premises systems (including data warehouses, ETL systems, analytics systems, Hadoop, and flat file sources) are transformed to a Databricks-native stack.
    • Data migration pipelines, Databricks pipelines, and notebooks are created automatically.
    • A dedicated, private connection is established to transfer data securely from the on-premises network leveraging Express Route/AWS Direct Connect/Cloud Interconnect.
    • The workload processing pipelines are then executed using Databricks jobs and workflows/Azure Data Factory/AWS Glue Studio/Cloud Dataflow and the data is ingested into cloud storage (Azure Data Lake Storage/ Amazon S3/Google Cloud Storage).
    • The transformed data warehouse and ETL code are automatically validated (through auto-generated reconciliation scripts) and executed via Databricks notebooks. The data is then ingested to Delta Lake.
    • The raw data is processed and made consumption-ready leveraging Databricks ETL workflows. Given the performance improvements of Databricks SQL, most BI reports can run directly on Databricks’ Gold/Presentation zone. This enables self-service analytics across the entire Lakehouse.
    • A cloud authentication and authorization mechanism is implemented to ensure secure access to resources and applications.
    • The data management platform, powered by Databricks Lakehouse, is also integrated with the consumption layer where the aggregated data is used for BI reporting.
    • The data management platform is integrated with Databricks MLflow and logs to cater to ML use cases and manage the end-to-end ML lifecycle.

    Databricks Native Stack

    Example of a high-level physical architecture on a Databricks-native stack

  9. Transform workloads end-to-end, including business logic
  10. Data migration is an important first step of the data estate modernization journey. But to realize the full ROI on modernization, it is vital to migrate the complete source code with the business logic, so that the legacy system can be retired, and infrastructure and licensing costs can be recouped. The complexity lies in migrating and modernizing code like stored procedures, ETL mappings, Shell scripts, SAS code (if any), and orchestration code (Control-M, AutoSys, etc.) LeapLogic handles these transformations easily and supports Databricks innovations like Delta Live Tables, the new SQL Photon Engine, etc. Here’s an example of how LeapLogic automatically transforms workloads, end-to-end.

    Data warehouse and ETL:

    • Legacy DDL scripts are converted to Delta tables
    • SQL scripts and procedural logic are transformed to a Lakehouse-native form like Databricks SQL, PySpark or Spark Scala
    • Scheduler/ workflow scripts like Control-M/AutoSys/EPS/Cron Shell are repointed; or transformed automatically to Databricks jobs or cloud-native orchestration engines like Airflow or ADF
    • Shell scripts with embedded Teradata blocks are converted to Lakehouse-compatible Shell scripts + Spark SQL
    • Built-in functions of the data warehouse are transformed to Lakehouse-native or custom functions
    • ETL scripts are converted to Lakehouse-native code in PySpark or Scala + Spark SQL
    • Legacy BI/consumption reports are repointed to Databricks SQL
    • Schema and code are optimized with embedded Spark performance tuning parameters

    SAS analytical scripts:

    • Legacy DDL scripts are converted to Delta tables
    • SQL scripts and procedural logic are transformed to Lakehouse-native code like PySpark
    • Business logic contained in procedural statements and macros is transformed to Lakehouse-native code like PySpark
    • Built-in SAS PROCs and analytical functions are converted to Lakehouse-native or custom functions using a rich library provided as part of the deliverables
    • SAS analytical workloads like data prep/acquisition (macros, data steps, conditional logic) and advanced PROCs/models are converted to Lakehouse-native code (PySpark or Spark Scala)

    The transformed code is then validated and certified on the Databricks Lakehouse using a repeatable, verifiable, and extensible framework.

  11. Cutover planning and decommissioning
  12. Modernization initiatives aim at transitioning the legacy system capacity to a new-age data architecture and instantly tapping the myriad benefits of Databricks. However, myopic planning can lead to a cutover disaster and completely derail project goals. LeapLogic helps enterprises implement a robust, phased migration strategy and decommission the migrated code seamlessly after extensive testing and productionization in the new environment. Decommissioning helps quickly limit cloud costs, ensure on-time project completion, and mitigate any risk of rolling back to a previous version.

  13. Operationalize migrated workloads over Databricks Lakehouse-native services
  14. The last leg of migration involves operationalizing the migrated workloads in the new environment using Lakehouse-native services. This enables a smooth transition into production and helps the enterprise go live on Databricks. Operationalization includes setting up DevOps and CI/CD processes, integrating with third-party tools/services, configuring proper cluster types and templates, enabling options for auto-scaling, ensuring auto-suspension and auto-resumption, and optimizing the cost-performance ratio. Enterprises also need to set up security policies and protocols (such as SSL and AD integration), define regulatory compliance standards, and set up tools for lineage, metadata, etc. LeapLogic takes advantage of implicit data governance and performs these steps automatically through infrastructure as code, while adhering to a specified project management methodology (like Agile). To accelerate DevOps, it handles defects, ensures code packaging and check-ins into SCM, enables version control, and manages universal artifacts.

    Final thoughts

    The promise of the cloud is compelling IT and data teams to take a hard look at their current systems and evaluate what they need to make this promise a reality. However, the path to modernization can be challenging and disruptive. Most companies are concerned about the decades of efforts and investments they have put in for writing and refining legacy system code. Also, with such legacy systems, there is often inadequate understanding of the complete logic, with only partial documentation available. The need for custom implementation of user-defined functions and keywords only adds to the complexity. This is where migration accelerators like LeapLogic help customers modernize their legacy data warehouse, ETL, and analytics with ~70-90% automated code conversion.

    LeapLogic has helped several Fortune 500 enterprises transform to Databricks faster, with more accuracy, and experience the freedom of the cloud without any business disruption.
Author
Gurvinder Arora
Lead Technical Writer