Nike’s Principal Data Engineer Ashok Singamaneni joins Benjamin and Eldad to discuss his open-source data quality framework, Spark Expectations. Ashok explains how the tool, which was inspired by Databricks DLT Expectations, shifts data quality checks to before the data is written to a final table. This proactive approach uses row-level, aggregation-level, and query data quality checks to fail jobs, drop bad records, or alert teams - ultimately saving huge costs on recompute and engineering effort in mission-critical data pipelines.
In this episode of The Data Engineering Show,
Benjamin and
Eldad are joined by
Ashok Singamaneni, a Principal Data Engineer at Nike. Ashok dives deep into his work on the open-source projects BrickFlow and Spark Expectations. He shares his journey from mechanical engineering to data engineering and the lessons learned over a decade of tackling production data quality issues that lead to costly recomputes.
Ashok explains the philosophy behind Spark Expectations: treating the ingestion and transformation layers of a data pipeline (Bronze/Silver) as a software product rather than just a data engineering product. This means implementing rigorous checks like data quality, unit testing, and integration testing before the data is written to the final layer. He details the implementation using a Python decorator pattern within Spark jobs, allowing engineers to define rules that check for everything from basic column validation to complex referential integrity and aggregation consistency. The discussion also covers the trade-offs of using generative AI tools like Cursor for data engineering and the growing industry trend of prioritizing upfront data quality due to the rise of AI-powered analytics and direct leadership access to data.
What You'll Learn:
- Why the ingestion and transformation layers (Bronze/Silver) of a data pipeline should be treated as a software product with rigorous testing.
- How Spark Expectations moves data quality checks to before data is written to the final tables to prevent mission-critical failures and recomputes.
- The three types of checks in Spark Expectations: row-level, aggregation-level, and query DQ (for referential integrity).
- How the tool handles failures with options to ignore, drop the record, or fail the entire job.
- Why big data quality is becoming a prime focus across the industry due to AI integrations and direct executive-level access to data.
- Ashok’s lessons on using Generative AI tools (like Cursor/Cloud Code) in data engineering projects and the necessity of restrictive permissions.
About the Guest(s)
Ashok Singamaneni is a Principal Data Engineer at Nike, with over twelve years of experience in the data space across the banking, healthcare, and retail domains. He is the creator of the popular open-source frameworks Spark Expectations and BrickFlow, which focus on improving data quality and pipeline reliability. Ashok advocates for treating data ingestion and transformation as a software product, ensuring checks and balances are in place early in the pipeline. He holds a background in mechanical engineering.
Quotes
"DLT expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables." - Ashok
"I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product." - Ashok
"If it's mission critical, then you fail the job, not process the data, and don't put that data into the final table so that you don't need to recompute that again." - Ashok
"As the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong... it takes time for you to debug and see, like, lot of human effort also involved." - Ashok
"Data observability and quality is becoming prime because of AI integrations that are happening." - Ashok
"Ultimately, at the end of the day, you are responsible when you're checking in the code. It's not Claude or Karsar that will be blamed if something goes wrong." - Ashok
"The leadership is directly looking at the data and if there is something wrong in the data, then there can be some serious repercussions happening on the business decisions." - Ashok
"Rather than having bad data in the tables and then recomputing or reclarifying things, let's not put that data first in the first place." - Ashok
"You can drop the record and put that in an error table and give that alert to the engineering team that there is some error in the error table you can look at." - Ashok
"The road eq checks that happens are very fast. It should happen as a pretty standard checks that happens on the scale." - Ashok
Resources
Projects:
- Spark Expectations - Data quality framework
- BrickFlow - Open source project for data pipelines
Tools & Technologies:
- Apache Spark
- Databricks DLT (Delta Live Tables)
- Great Expectations - Post-processing data quality tool
- Cursor / Cloud Code - Generative AI coding tools
- SQLMesh
For Feedback & Discussions on Firebolt Core:
Primary Speakers: