AI Won't Replace Engineers, But This Framework Will Change How They Build with Rohit Girme

Scaling AI from proof-of-concept to production requires more than just deploying models; it demands robust evaluation frameworks, human oversight, and a fundamental shift in how engineering teams approach development.

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Rohit Girme, Staff Software Engineer at Airbnb, to explore how Airbnb built a Gen AI evaluation platform to assess LLM outputs across product surfaces, from customer support bots to search and booking experiences. Rohit shares insights into Airbnb's infrastructure choices, evaluation workflows, and lessons learned about leveraging AI tools while maintaining human orchestration.

What You'll Learn:

- How to architect a multi-layer Gen AI evaluation platform using Python, VLLM, Kubernetes, and DAG-based workflows to systematically test LLM outputs in production

- Why splitting monolithic "virtual judges" into specialized LLM-powered metrics (content relevance, hallucination detection, policy adherence) dramatically improves evaluation accuracy and debugging

- The critical distinction between real-time evaluation (lightweight, sub-second latency) and offline evaluation (comprehensive, human-in-the-loop) and how to route outputs accordingly

- How to shift from traditional software engineering (deterministic, rule-based testing) to probabilistic AI evaluation where you validate outputs against golden datasets and human judgment benchmarks

- The framework for breaking down problems into smaller chunks and using AI tools as collaborators rather than end-to-end problem solvers—critical when working with codebases at massive scale

- Why documentation becomes infrastructure in an AI-driven workflow: LLMs need comprehensive, well-formatted docs to scale tribal knowledge across entire organizations

- The hard truth about AI and scaling: zero-to-one innovation is now commoditized, but one-to-n execution (the scaling part) still demands human judgment, orchestration, and product sense

- How to measure AI tool adoption beyond token usage instrument your development workflow to capture whether LLM suggestions actually made it into shipped code and added real value

About the Guest(s)

Rohit Girme is a Staff Software Engineer at Airbnb, where he has spent the last seven and a half years building infrastructure and platforms at scale. With deep expertise in search and machine learning infrastructure, Rohit leads efforts in GenAI evaluation and has pioneered Airbnb's approach to ensuring AI-powered features work reliably in production. In this episode, Rohit shares practical insights on building evaluation platforms for large language models, orchestrating AI in product workflows, and leveraging AI tools effectively in software development. His work on integrating LLMs into customer-facing products while maintaining quality and performance provides actionable strategies for engineering teams navigating the rapid adoption of AI, making this conversation essential for data engineers and platform builders looking to scale AI responsibly.

Quotes

"Zero to one is easy now, but the one to n, which is a scaling part, I think we still haven't figured that out. You still need humans for that." - Rohit

"With AI, it's a black box to us as well. We don't know how it's working underneath, so we have to figure out another way to evaluate the surface." - Rohit Girme

"Humans should be the orchestrators of these tools and not just hand off everything to these tools." - Rohit Girme

"If we hand off everything to the LLM, it will make a lot of assumptions because context is limited, and it doesn't know the code enough." - Rohit Girme

"Documentation has become even more relevant because now LLMs need to know everything so everyone can scale up." - Rohit Girme

"Measuring productivity in LLMs is not just about how many tokens people are using—you need to figure out if they're actually building something on top." - Rohit Girme

"Internet democratized information, and I think with LLMs, it's capability that would be democratized. If you have a good idea, you can build it very quickly." - Rohit Girme

"There's always going to be blind spots for every person, but with AI, it'll become even faster because you have this very short cycle of talking to the AI instead of talking to five humans." - Rohit Girme

"Shipping products or shipping features would become even faster—where earlier it took weeks or months, now it will be days." - Rohit Girme

"I have supercharged my workflow day to day either at work or at home with access to information that's so easy to get." - Rohit Girme

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review

Resources

LinkedIn Profiles:

Rohit Girme's LinkedIn: https://www.linkedin.com/in/rohitgirme/
Benjamin's LinkedIn: https://www.linkedin.com/in/wagjamin

Company Websites:

Airbnb: airbnb.com
Firebolt: firebolt.io

Tools & Platforms:

VLLM – Open source inference framework for hosting and running LLM-based inference engines
Kubernetes – Container orchestration platform used for serving infrastructure
Apache Airflow – DAG-based workflow orchestration tool (originated from Airbnb)
GitHub Copilot – AI-powered code completion tool for software development
Claude – LLM tool referenced for code generation and development assistance

Cloud Services:

Azure – Hosted LLM services used at Airbnb
AWS – Hosted LLM services used at Airbnb