|
Data Warehousing Series
|
Introduction
Besides the fact that data professionals don't have direct access to D365
F&O production environments (direct access is available only to
sandboxes), which was from the beginning an important constraint imposed by
the architecture, there are a few more challenges that need to be addressed
when working with the data.
Case Sensitiveness
SQL Server is not case sensitive, therefore, depending on the channel though
which the data came, values appear either in upper or lower case, respectively
a mixture of both. Even if this isn't an issue in D365, it can become an issue
when the data leave the environment. E.g., PowerQuery is case sensitive (while
DAX is case insensitive), thus, if a field containing a mix of values
participate in a join or aggregation, this will result in unexpected behavior
(e.g., duplicates, records ignored). It's primarily the case of the Company
(aka DataAreaId) field available in most of the important tables.
The ideal solution would be to make sure that the values are correct by
design, however this can't be always enforced. Otherwise, when using the data
outside of D365 F&O the solution would be to transform all the values in
upper case (or lower case). However, also this step might occur too late.
E.g., when the data are exported to the Azure Data Lake in parquet file
format.
Unique Keys
A unique record in D365 F&O was in earlier versions usually identified by
the RecId and DataAreaId, while later the Partition field was added. This
means that most of the joins will need to consider all 3 columns, which adds
some overhead. In some environments there's only a Partition defined (and thus
the field can be ignored), however this is not a warranty.
As long developers use SQL there's no issue of using multiple fields in JOINs,
though in PowerQuery there must be created a unique key based on the
respective records so the JOINs are possible. Actually, also SQL-based JOINs
would benefit if each record would be identified by one field.
Audit Metadata
Not all tables have fields that designate the date when a record was created
or last modified, respectively the user who performed the respective action.
The fields can be added manually when setting up the system, however that's
seldom done. This makes it difficult to audit the records and sometimes it's a
challenge also for reporting, respectively for troubleshooting the differences
between DWH and source system. Fortunately, the Export to Data Lake adds a
timestamp reflecting the time when the record was synchronized, though it can
be used then only for the records synchronized after the first load.
Tables vs. Entities
Data are modified in D365 F&O via a collection of entities, which are
nothing but views that encapsulate the business logic, being based on the base
tables or other views, respectively a combination of both. The Export to Data
Lake (*) is based on the tables, while Link to Data Lake is based on data
entities.
Using the base tables means that the developer must reengineer the logic from
the views. For some cases it might work to create the entities as views in the
DWH environment though some features might not be supported. It's the case of
serverless and dedicated SQL pools, that support only a subset from the
features available under standard Azure SQL Server.
The developer can try to replicate the logic from entities, considering only
the logic needed by the business, especially when only a subset from the
functionality available in the entity was used. The newly created views can
become thus more readable and maintainable. On the other side, if the logic in
entity changed, the changes need to be reflected also in the DWH views.
Using the entity-based data makes sure that the data are consistent between
environments. Unfortunately, Microsoft found out that isn't so easy to
synchronize the data at entity level. Moreover, there are multiple entities
based on the same table that reflect only a subset of the columns or rows.
Thus, to cover all the fields from a base table, one might be forced to
synchronize multiple views, leading thus to data duplication.
In theory, both mechanisms can be used within the same environment, even if
this approach is against the unique source of truth principle, when data are
duplicated.
Data Validation in the Data Lake
One scenario in which both sources are useful is when validating whether the
synchronization mechanism worked as expected. Thus, one can compare the number
of records and check whether there are differences that can't be mitigated.
However, does it make sense to "duplicate" database objects only for this
purpose?
Ideally, to validate whether a record was synchronized should be done in the
source environment (e.g. via a timestamp). That's difficult to achieve,
especially when there's no direct access to the source database (as is the
case for Production databases). Fortunately, Dataverse provides this
functionality, even if might not be bullet proof.
In extremis, the most reliable approach is to copy the production environment
on a sandbox and do a count of records for each table, using as baseline for
comparison the time when the refresh occurred.
Base Enum Values
The list of values that don't have their own tables are managed within the
application as Base Enums and, naturally, only the numeric values being saved
to the database. Even if this is practical for the application, it's a
nightmare for the people using the data exported from database as is needed to
convert the codes to meaningful values. Some of the mappings between the codes
and values are documented in
two system tables, and even in old language-based documentation, though both sources are far
from complete. As alternative, one can try to discover the values in the
system.
Unfortunately, the mappings need to be repeated when the Enum-based attributed
is used in multiple places. One can reduce mapping's duplication by
encapsulating the logic into a view (aka "base view") and reused accordingly
(see the logic for
TDM.vEcoResProduct).
Even if the values for many of the Enums are stored into the EnumValueTable
table, Enum's name being available in EnumIdTable table, it's not a good idea
to retrieve the values via a JOIN in the business logic. This would complicate
the business logic unnecessarily. A CASE is more efficient even if
occasionally more difficult to maintain. Unfortunately, there's no timestamp
to identify which values were added lately.
Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export
to Data lake to Synapse link. Microsoft advises new customers to use Synapse
Link.
Previous Post
<<||>>
Next Post