The Data Engineering Show
Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach
April 8, 2025
In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources. They discuss data catalogs and how they refine the data management process.
In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources. 

What You’ll Learn:


If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts instructions on how to do this here [insert link].

Lisa Cao is a Product Manager at DataStrato, specializing in AI/ML product partnerships and developer relations. With deep expertise in data catalog technologies and open-source ecosystems, she plays a key role in developing Apache Gravitino, an ASF incubating project that provides a unified governance and security layer for diverse data systems. Her work in developing extensible catalog frameworks has helped organizations manage complex data environments across multiple platforms.

Episode Highlights:

Apache Gravitino is a meta-catalog that serves as a unified data governance and security layer used to manage different data systems. Lisa shares that Gravitino was the first to release an iceberg rest catalog and ended up open sourcing for the general community to use and as time passed, Polaris and Unity Catalog were also announced in open source. She highlights that although Gravitino, Polaris and Unity Catalog are very similar, Gravitino differs in that it is able to support multiple catalogs.

One of the interesting things about Gravitino is that it offers more than just a catalog of data models and these model catalogs are the first step into looking at how to merge two worlds of AI and ML catalogs. Lisa shares the goal of effective management, that is, creating a system that can store and manage different types of data models, track changes to the models, and control access to the models.

Think of Gravitino as a “traffic cop” that helps to manage and secure data from multiple sources. It is crucial to have a system that provides unified access control across all data sources, allowing teams to manage access and data governance so that ML teams don't have to worry about access. Lisa says that Apache Gravitino is the system that makes data accessible to different teams and users while making sure that it is secure and governed appropriately. 

Every query engine has its own way of managing data, which makes it difficult to switch between engines - you have to reconfigure everything. Lisa highlights that Gravitino solves the problem by providing a single layer of data governance that works across multiple query engines.


Lisa talks about how fast the data engineering space is moving and shares some insights to catching up;


She warns against the social media hype that can amplify the messaging around new tools, making it seem everyone is using it, when in reality, that can’t be easily seen.

Episode Resources:



The Data Engineering Show is handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes: