Ab Initio vs AWS Glue: Understanding the Difference

ETL (Extract, Transform, Load) allows enterprises to extract data from its original source, cleanse and prepare it, and load it into the target database for consumption by data and analytics teams.

Over the years, ETL processes and tools have evolved significantly to cater to the changing needs of enterprises. This blog compares how two of the most widely used ETL platforms – Ab Initio and AWS Glue – stack up against each other.

Understanding The Basics

Ab Initio

Ab Initio is an on-premises tool used for:

ETL operations
Code storage
Versioning
Data quality
Metadata management

It offers several parsers, connectors, and components that can read and write files in multiple formats. Ab Initio has evolved as a powerful GUI-based parallel processing data integration platform and is leveraged for multiple use cases and business applications – from operational and quality management systems to distributed application integration and complex event processing.

AWS Glue

AWS Glue is a fully managed ETL service used to extract, transform, and load data into a target database. As a serverless data integration service, it works well with semi-structured data like Clickstream or process logs.

Additionally, data engineers and ETL developers can visually create, execute, and monitor ETL workflows in AWS Glue Studio. Analysts can visually enrich, clean, and normalize data using AWS Glue DataBrew, which offers 250+ pre-built transformations.

Both Ab Initio and AWS Glue offer several powerful features for ETL processing. Here’s a head-to-head comparison.

Category	Ab Initio	AWS Glue
Inception	Started as a platform for ETL and enterprise application integration	Started as a unified data catalog and ETL service
Installation and maintenance	Has a full installation cycle involving high setup costs, maintenance overheads, and operational intervention	Offers serverless architecture that does not need operational intervention. Sets up almost everything automatically (on pay-per-use basis), including features for provisioning, scaling, and maintenance
Integration	No direct integration with AWS services, but can be extended using custom scripting	Provides full integration with other AWS services
Code generation	Generates code in a proprietary format which can sometimes be difficult for developers to further enhance/modify	Generates code which can be easily consumed by developers for further enhancement/modification
Scheduling	Does not have a native scheduler, relies on a third-party scheduler such as Autosys or Control-M	Provides in-built scheduling
Transformations	Enables a large number of transformations	Enables a large number of transformations
Output code and transformation format	Generates .KSH scripts and uses a proprietary XFR code format for transformations. This code is difficult to understand and manually perform changes on	Generates Python Spark-based code which is relatively easy to understand and modify
Performance	Provides job level, component level, and data level parallelism. Once a few parent transformation records are available, subsequent transformations can start processing, which enables high performance	Uses Spark lazy evaluation concepts and processes multiple components in a single execution
Price-performance ratio	Job performance is good, but extension to the current environment can be time-consuming	Can spin multiple clusters to process jobs with a pay-per-use model (minimum session time is 30 min.)
Ease of use	Does not provide a notebook-like interface	AWS Glue Studio provides a notebook interface for users to create, run, and monitor ETL jobs
Debugging and monitoring/td>	Provides debugging and monitoring using proprietary methods, which involve a learning curve	Enables integration with CloudWatch for monitoring all logs at one place. Users can profile the code and visualize metrics on Glue to debug an application.
Language support	Supports writing Python and Shell scripts	Supports writing Python and Shell scripts
Pricing	Users pay for ETL processing, licensing, operational overheads etc.	Users pay per use and are only charged for job execution
Security	Supports automatic identification of PII and sensitive data	Supports automatic identification of PII and sensitive data
Change Data Capture (CDC) support	Relies on the database to capture CDC requirements	Supports CDC via AWS Database Migration Service
Connectivity and support	Provides robust support for Cobol/EBCDIC and Mainframe integration	Does not provide Cobol/EBCDIC integration support natively, but this can be handled through custom scripting
Storage and compute separation	Separates storage from compute capacity	Separates storage from compute capacity

Ab Initio vs AWS Glue: Pros

In addition to the capabilities highlighted above, here’s our take on some of the important pros of Ab Initio and AWS Glue:

Final Thoughts

Which tool is best overall? That’s something every organization must decide based on its unique data architecture and analytics needs.

There are several factors to consider, including:

Elasticity
Cost
Collaboration efficiency
Operational excellence
Reliability, etc.

What’s more, the tool needs to seamlessly integrate with other infrastructure/services running in the overall data warehouse environment.

Based on our experience with large-scale data engineering and cloud transformation projects, we believe AWS Glue provides a significant competitive edge over traditional ETL tools. To learn how you can seamlessly modernize your ETL workloads from Ab Initio to AWS Glue, watch this quick demo.

Author
Abhijit Phatak
Director of Engineering