Ab Initio vs. AWS Glue

ETL (Extract, Transform, Load) allows enterprises to extract data from its original source, cleanse and prepare it, and load it into the target database for consumption by data and analytics teams. Over the years, ETL processes and tools have evolved significantly to cater to the changing needs of enterprises. This blog compares how two of the most widely used ETL platforms – Ab Initio and AWS Glue – stack up against each other.

Ab Initio is an on-premises tool used not only for ETL operations, but also for code storage, versioning, data quality, and metadata management. It offers several parsers, connectors, and components that can read and write files in multiple formats. Ab Initio has evolved as a powerful GUI-based parallel processing data integration platform and is leveraged for multiple use cases and business applications – from operational and quality management systems to distributed application integration and complex event processing.

AWS Glue is a fully managed ETL service used to extract, transform and load data into a target database. As a serverless data integration service, it works well with semi-structured data like Clickstream or process logs. Additionally, data engineers and ETL developers can visually create, execute, and monitor ETL workflows in AWS Glue Studio. Analysts can visually enrich, clean, and normalize data using AWS Glue DataBrew, which offers 250+ pre-built transformations.

Both Ab Initio and AWS Glue offer several powerful features for ETL processing. Here’s a head-to-head comparison.

Category Ab Initio AWS Glue
Inception Started as a platform for ETL and enterprise application integration Started as a unified data catalog and ETL service
Installation and maintenance Has a full installation cycle involving high setup costs, maintenance overheads, and operational intervention Offers serverless architecture that does not need operational intervention. Sets up almost everything automatically (on pay-per-use basis), including features for provisioning, scaling, and maintenance
Integration No direct integration with AWS services, but can be extended using custom scripting Provides full integration with other AWS services
Code generation Generates code in a proprietary format which can sometimes be difficult for developers to further enhance/modify Generates code which can be easily consumed by developers for further enhancement/modification
Scheduling Does not have a native scheduler, relies on a third-party scheduler such as Autosys or Control-M Provides in-built scheduling
Transformations Enables a large number of transformations Enables a large number of transformations
Output code and transformation format Generates .KSH scripts and uses a proprietary XFR code format for transformations. This code is difficult to understand and manually perform changes on Generates Python Spark-based code which is relatively easy to understand and modify
Performance Provides job level, component level, and data level parallelism. Once a few parent transformation records are available, subsequent transformations can start processing, which enables high performance Uses Spark lazy evaluation concepts and processes multiple components in a single execution
Price-performance ratio Job performance is good, but extension to the current environment can be time-consuming Can spin multiple clusters to process jobs with a pay-per-use model (minimum session time is 30 min.)
Ease of use Does not provide a notebook-like interface AWS Glue Studio provides a notebook interface for users to create, run, and monitor ETL jobs
Debugging and monitoring Provides debugging and monitoring using proprietary methods, which involve a learning curve Enables integration with CloudWatch for monitoring all logs at one place. Users can profile the code and visualize metrics on Glue to debug an application.
Language support Supports writing Python and Shell scripts Supports writing Python and Shell scripts
Pricing Users pay for ETL processing, licensing, operational overheads etc. Users pay per use and are only charged for job execution
Security Supports automatic identification of PII and sensitive data Supports automatic identification of PII and sensitive data
Change Data Capture (CDC) support Relies on the database to capture CDC requirements Supports CDC via AWS Database Migration Service
Connectivity and support Provides robust support for Cobol/EBCDIC and Mainframe integration Does not provide Cobol/EBCDIC integration support natively, but this can be handled through custom scripting
Storage and compute separation Separates storage from compute capacity Separates storage from compute capacity

In addition to the capabilities highlighted above, here’s our take on some of the important pros and cons of Ab Initio and AWS Glue:

Which tool is best overall? That's something every organization must decide based on its unique data architecture and analytics needs. There are several factors to consider, including elasticity, cost, collaboration efficiency, operational excellence, reliability, etc. What’s more, the tool needs to seamlessly integrate with other infrastructure/services running in the overall data warehouse environment. Based on our experience with large-scale data engineering and cloud transformation projects, we believe AWS Glue provides a significant competitive edge over traditional ETL tools. To learn how you can seamlessly modernize your ETL workloads from Ab Initio to AWS Glue, watch this quick demo.

Author
Abhijit Phatak
Director of Engineering