Exclude irrelevant columns and rename any columns that are confusing. When going through and recreating the schema with views of the relevant tables you should also clean up what’s in each table. The first step is to develop guidelines for how you want to clear up the data. To address these issues we need to keep the analyst/business user in mind and make all of the fields easy for them to interpret. Deprecated data flags are often missed by analysts, so this leaves room for error in aggregations.Most analysts are not able to use regex to parse out valuable information from JSON data.Non-descriptive Column names and values will require the analyst to ask an engineer for clarification.Inconsistent naming reduces confidence that the data is correct, and makes it hard to aggregate and group the data.Nulls can produce unexpected results during aggregations.Having multiple Id columns can be confusing.We can review the table we referenced in Why Build a Data Warehouse: This makes it challenging for business users to analyze the data without consulting the engineer. Table names, column names, and even a column’s purpose for being in the table can be confusing to others. Table and column names are typically created by engineers to be used by the application the data comes from. This can sit alongside the cleaned up normalized version of the Data Warehouse. If you want to make it even easier to work with a specific set of data, you can create a wide table (view) that does all the joins. The main thing we want to do to simplify the schema is to exclude tables from the new views that only contain app specific logic and are not useful for analysis. There is no need to move away from 3rd normal form. We create a Single Source of Truth by creating views on top of the existing schema. Today, due to advances in BI tools such as Chartio and Data Warehouse technologies, dimensional modeling is no longer worth the effort. In the past, books recommended using dimensional modeling to reduce the schema complexity, make it easier to run queries, and enhance performance. But this type of schema can be difficult to navigate and many tables will never be used in an analysis. In a Data Lake, the schema reflects in transactional logic of an application and follows best practices (such as a 3rd normal form) so that updating values will not produce errors. In addition, if you want to maintain access to old/unused data sources from your Data Lake in your Data Warehouse, you can label data sources as deprecated or approved to help guide people during their analysis. This will require some renaming and cleaning to accomplish. This ensures the historical records are not lost and creates one location for relevant metrics. If this is not an option, use the data warehouse to create a table which UNIONs the data from both sources. When your company has used multiple tools to track the same type of data, if you can, migrate the data from the previous tools into the latest tool. When an analyst attempts to query for sign-ups, it will be unclear which data source they should use. Moreover, the Google Analytics data might not be as well synchronized between your Hubspot data and your Salesforce data. That means that prior to your switch, the Salesforce data will be empty. Imagine you were tracking sign-ups via Hubspot and after a year you decided to switch to Salesforce. You will also have data from dormant data sources in your Data Lake that is still needed for certain analyses. Data Sourcesīefore you even build a Single Source of Truth, your company will likely have data sources that overlap in terms of what they track. Let’s explore the problems that a Single Source of Truth solves, issues to watch out for, and best practices. Creating a Single Source of Truth requires data engineering effort. This is an obvious thing that any company wants, yet a lot of companies struggle to deliver. The promise of a Single Source of Truth is accuracy across your organization. Metrics need to be derived from the dataĪ Data Warehouse is a database where the data is accurate and is used by everyone in a company when querying data.Unfortunately, this is typical when data has not been cleaned up into a Single Source of Truth. Additionally, it’s demoralizing for everyone and time-consuming to figure out the right answer. When multiple people ask the same question using the same data and get varying answers, it creates doubt in all of the data in your organization.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |