Portfolio
The solution for Culqi Data Lake on AWS
Category:
Integrated technologies:
Amazon S3, AWS EMR, AWS SageMaker Studio, AWS Directory Service, AWS Glue, AWS DMS, Amazon EKS, Amazon EC2, ALB (Application Load Balancer), Amazon Aurora, Amazon ECR, VPN, AWS Security, IAM, CloudWatch, EventBridge, CloudTrail, AWS Secrets Manager.
Culqi Data Lake on AWS
The migration of Culqi’s data warehouse to a Data Lake on AWS, managed by BigCheese, has transformed its data analytics capabilities, providing a scalable, secure and flexible solution to meet the challenges of an ever-evolving data environment.
The challenge
Culqi, a leading electronic payment solutions company, was facing the need to modernize and scale its data infrastructure. They had a legacy data warehouse that did not meet the growing demands of handling structured and unstructured data, real-time analytics, and the need to access diverse sources of information efficiently.
The main challenge was to migrate its data warehouse to a more scalable and flexible solution that would allow the handling of large volumes of structured and unstructured data, as well as integrate with multiple data sources. All this had to be done without affecting performance or business continuity.
AWS applied in this challenge
Amazon S3, AWS EMR, AWS SageMaker Studio, AWS Directory Service, AWS Glue, AWS DMS, Amazon EKS, Amazon EC2, ALB (Application Load Balancer), Amazon Aurora, Amazon ECR, VPN, AWS Security, IAM, CloudWatch, EventBridge, CloudTrail, AWS Secrets Manager.
Technical solution
BigCheese partnered with Culqi to design and implement a Data Lake on AWS, with the following key features and services:
- Design of a Decoupled Data Lake:
- A decoupled Data Lake was implemented to allow component independence, which facilitates the individual scalability of each part of the system. This ensures that Culqi can adjust and scale resources without depending on the rest of the infrastructure.
- The design supports both structured data (relational databases) and unstructured data (files, logs, etc.).
- Storage in AWS S3:
- For data storage, Amazon S3 was used. It was structured in three or more layers to optimize information handling and processing:
- Ingest Layer: Receives near real time data from multiple sources.
- Processing Layer: Data enters transformation processes through pipelines.
- Presentation Layer: Where the processed data is ready for consumption.
- For data storage, Amazon S3 was used. It was structured in three or more layers to optimize information handling and processing:
- Scalable Processing with AWS EMR:
- Data processing is performed using an AWS EMR cluster, which allows it to scale vertically to consume compute resources on demand. This flexibility is critical to respond to fluctuating loads and processing peaks in big data analytics.
- Query Engine: Trino:
- For data querying, we opted for Trino, a SQL engine that runs on top of the EMR cluster and allows Culqi to access and cross-reference information from multiple databases (such as Google Sheets, MySQL and others). This ensures an efficient integration of different data sources for advanced analysis.
- Pipeline Development and Data Science with AWS SageMaker:
- Culqi’s data science teams use AWS SageMaker Studio running on the same EMR cluster, allowing them to develop and manage machine learning pipelines and advanced analytics in an agile way, without the need for additional infrastructure.
- Data Exploration with Metabase:
- For data visualization and exploration, Culqi uses Metabase, which runs on AWS, providing business teams with a fast and accessible analysis tool.
- Security and Governance with AWS Directory Service:
- Security and access control are managed through the AWS Active Directory managed service, which ensures data governance and compliance with security regulations.
The results
Thanks to the solution implemented by BigCheese, Culqi has benefited from:
94%
of reduction in data ingestion time
We had a 94% reduction in intake time. Before it took 5 hours to take the data, now in only 18 minutes you have all the updated data.
89%
of improvement in consultation times
We improved consultation time by 89%. What happened? Before there was a wait of minutes, which made the analysis very tedious. Now we have reduced the consultation time to seconds.
90%
more users accessing the data, from 100 to 290 active users.
Now Culqi is consuming the data and is consuming it 190% more.
RealTime
Greater integration of data sources, allowing near real-time connection.
One of the big differences between the Data Lake and the Data Warehouse is the number of databases to which it can be integrated. In this case, we made a greater integration of data sources, in addition to increasing these sources, what we did was to improve the query time almost to real time.
100%
of improvement in the visibility of data consumption.
If people don’t use it, if you do something and they don’t use it, it’s very frustrating. But in this case we have a 100% improvement in the visibility of data consumption. There are more than 20,000 daily queries to this Data Lake. People are using it, they are adopting it, they are becoming Data Driven.
67%
reduction in Time-to-Market, accelerating the implementation of new strategic analysis and reporting.
Products now go out 67% faster than they did before.