The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

In this episode of the Data Engineering Show, host Benjamin Wagner sits down with Ritesh Varyani, Staff Software Engineer at Lyft, to explore how the company manages a sophisticated multi-engine data stack serving thousands of engineers, while simultaneously integrating AI across infrastructure and user-facing analytics.

What You'll Learn:

How to architect a polyglot data platform that serves fundamentally different workloads, Spark for ML training and massive parallel processing, Trino for dashboarding and medium-scale ETL, and ClickHouse for sub-second OLAP queries without creating operational chaos
Why unification matters more than expansion: Lyft's 2026 strategy prioritizes consolidating and simplifying the data stack rather than adding new tools, reducing maintenance burden and improving reliability for end users
The dual-layer AI strategy that simultaneously enhances user analytics (semantic layer v2 with AI-native support) while automating platform operations (intelligent job failure diagnosis, adaptive resource allocation, and agentic workflow optimization)
How to fund innovation from the bottom-up: Lyft's model encourages individual engineers to experiment with AI on their own time, prove business value through POCs, and secure leadership buy-in through demonstrated alignment with company strategy
Why vendor selection now includes AI explainability and debuggability as standard RFP requirements, even when AI isn't the primary driver of a purchasing decision
The framework for deciding open-source investment vs. managed services: Prioritize business-critical goals first, then determine whether in-house ownership or vendor solutions accelerate that mission, AI becomes the accelerant, not the decision driver

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ritesh is a Staff Software Engineer at Lyft, bringing six years of experience architecting and scaling the company's data platform. With a background spanning Microsoft's data and cloud infrastructure, including work on Hadoop, Azure, and SaaS products. Ritesh leads Lyft's critical data systems including Trino, Spark, and ClickHouse. In this episode, Ritesh shares insights on building scalable, AI-native data platforms that serve diverse organizational needs, from batch processing and analytics to real-time marketplace operations. His strategic approach to unifying complex data stacks while integrating AI-driven reliability and user experience improvements provides actionable guidance for data engineers and platform leaders navigating infrastructure modernization at scale.

Quotes

"The goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are getting and take better data driven decisions." - Ritesh

"We are a Hive format shop. We are going to be moving to other open table formats in the future, but at this point, we are a hive table format." - Ritesh

"Our main goal at this point is primarily understanding how we see the data platform running five years from now, three years from now, and how we are able to future proof it." - Ritesh

"In this world of AI, we should not be falling behind in any way, and bringing AI in the right places within our platform." - Ritesh

"We want to make our semantic layer ready for the AI native side of things so that our teams are able to drive the best meaning possible from the data that they see." - Ritesh

"Big data systems are distributed systems by nature, and where AI can help you is very clearly understand how the patterns are changing and what is a good action to take." - Ritesh

"Rather than thinking of this as an AI versus an open source thing, it's about a question of what work is the most business critical and how do you go 100% behind it." - Ritesh

"Not everybody is working on AI initiatives at this point, but where it makes sense according to our business strategy, if it aligns with it, then obviously we go and invest." - Ritesh

"If you are the one who's going to take on the initiative, probably spend a few hours outside of what you're already working on, and that is how you will discover AI and the tooling for it." - Ritesh

"We are trying to consolidate into a single direction of providing different kinds of models so that you are easily able to integrate and focus on the value you want to provide to your customers." - Ritesh

Resources

Connect on LinkedIn:

Ritesh Varyani - https://www.linkedin.com/in/riteshvaryani/
Benjamin Wagner - https://www.linkedin.com/in/wagjamin/
Eldad Farkash - https://www.linkedin.com/in/eldadfarkash/

Websites:

Lyft - https://www.lyft.com

Tools & Platforms:

Apache Spark – Batch processing engine for ML training jobs, large-scale data processing, and GDPR operations
Trino – Query engine for BI dashboarding, ETL workflows, and SQL-based data access
ClickHouse – Columnar database for sub-second query latency and real-time analytics
Amazon S3 – Data lake storage for parquet tables and offline data processing
AWS EKS (Elastic Kubernetes Service) – Kubernetes infrastructure for hosting Spark and Trino
ClickHouse Cloud – Managed ClickHouse offering used by Lyft
Hive Table Format – Current table format for organizing parquet files in S3
Kubernetes Operators – Infrastructure for managing ClickHouse deployments