Data catalogs store information about enterprise data. They organize ownership, definitions, and relationships for the data available in a company's systems. They enable analysts to find the information they need for a project quickly while empowering compliance teams to effectively manage access to sensitive data.
Why do companies use data catalogs?
Catalogs to make it simple for teams to navigate and control data assets - empowering users to:
- More efficiently find data for a dashboard.
- Identify and track sensitive information for regulatory purposes.
- Control who should have access to data via policy enforcement and role-based access control.
What information do data catalogs store?
Many types of metadata can be stored, managed, and discovered through a data catalog:
- Metadata on dashboards - In large organizations, the team that creates a dashboard is not the team that will be using the dashboard to make business decisions. It is important to organize metadata on the charts, the metrics, and the nuances of each dashboard, so users can quickly identify how they should be interpreting data to make business decisions.
- Metadata on internal business systems - In many scenarios, analysts start their journey by identifying a business system, and then discovering useful metrics within that system to power insights. By storing metadata on the data that resides in each system (a SAAS tool, a database, or a data lake), teams can more effectively uncover valuable insights.
- Metadata on data sets - Data sets describe a specific entity within the enterprise. A ‘deals’ data set from a customer relationship management (CRM) tool would store information on every deal in the sales pipeline. By organizing metadata on each data set, users can better understand what each entity represents to the business, how entities relate to each other, and how best to drill into the particular attributes of each entity to generate novel insights.
- Metadata on particular attributes - In the ‘deals’ data set above, there are likely ten to twenty important attributes for to each deal (owner, primary contact, deal size, etc.). Each attribute is represented as a column, a field, or a property in data systems. This is typically the most granular level of metadata organized within data catalogs. For example, an ‘contact_email_address’ field in the ‘deals’ data set could have a description, an owner, and a sensitivity level.
At each level of granularity, companies will track metadata on data lineage (i.e. what system each piece of data came from), when data was last updated, as well as granular relationships that show how entities relate to one another.
How are data catalogs populated with information?
Data catalogs are unbelievably powerful tools when they are kept up to date. However, if information is not populated, becomes stale, or is not treated as the source of truth within the enterprise, the catalog can provide limited benefit to the organization.
Many cataloguing platforms can automatically generate certain metadata from business systems. For example, cataloguing tools will automatically sample a subset of the data in a system, and feed the information into artificial intelligence / machine learning (AI / ML) models to determine whether the data is sensitive (does it look like personally identifiable information) and what it represents (does it look like an email address or a company domain). With automation and effective logic, cataloguing tools can automatically generate and manage certain metadata with little to no human input.
However, there will always be aspects of the cataloguing process that are manual. Users will continue to create new custom attributes, or build valuable data sets that never existed before. Companies need to have a strong culture of data governance - combined with policies, procedures, and controls - to ensure changes are represented in the data catalog for the benefit of the broader organization.
The complexity of effective metadata management
Managing and organizing metadata throughout the entire modern data stack is critical to scalability of a data-driven enterprise. However, with the number of moving parts in the modern data stack, this can quickly become overwhelming. Here is a quick reminder of how complex data workflows can become:
- Event-level data attributes are defined and organized using a tracking plan.
- Event-level data is collected from websites and mobile apps by a customer data platform (CDP) or product analytics tool.
- Data is collected in business applications - either through manual input or automated interactions.
- Extract, load, transform (ELT) tools extract data from event sources and business applications and sync the data into a data warehouse.
- The data is stored for processing in a data warehouse.
- A transformation platform turns raw data into insights in the warehouse.
- A policy enforcement layer applies role-based access control to datasets, tables, and fields.
- Data is turned into dashboards through a visualization solution.
- 2nd party and 3rd party users can consume certain data sets directly via marketplaces with secure sharing capabilities.
- Reverse ETL solutions activate data back into business applications for process automation.
Given such complexity, many data cataloguing solutions focus on three aspects of the workflow above.
- When data is at rest in a data warehouse (or a data lake, or a database), the catalog can become an effective source of truth for metadata.
- When data is turned into dashboards, catalogs can annotate the key metrics and insights that have been created.
- When data is shared with 2nd and 3rd parties, catalogs can offer a simple way of organizing and sharing metadata on what information is available to be consumed or purchased.
Without a clear way to integrate metadata throughout the remaining components of the modern data stack, finding and leveraging data can rapidly become unwieldy.
We believe ELT and Reverse ETL solutions have an important role to play in metadata management
ELT solutions produce highly valuable metadata. Not only do ELT solutions have direct connections into event collection tools and business applications, but they already schematize and annotate information for analytics in the warehouse. ELT tools are well positioned to sync valuable metadata automatically to data catalog solutions, reducing manual efforts, and helping to integrate metadata throughout the entire modern data stack.
Reverse ETL solutions must be able to consume and deliver metadata. For Reverse ETL solutions, it is critical to not only sync raw attributes from the warehouse back into operational tools, but to also leverage, and include, key metadata as well. Similar to how visualization tools need to respect policy enforcement and role-based access control, Reverse ETL solutions need to ensure data is discoverable and accessible, but controlled at the same time.
Want to learn more? Book time for a discussion or a demo directly on my calendar
Enjoy our posts? Subscribe to our newsletter to receive content directly in your inbox.