Monday, April 29, 2019

Four Data Integration Design Questions to Ask

We get requests to move data between custom systems regularly, even within systems. I was advising a client on something fairly simplea collection of tables out of one vendor system to feed anotherand I thought I'd make a blog post out of the array of questions that always come up.

Regardless of the strategy for data movement, whether it be:
  • SQL Server Integration Services (SSIS) locally or in Azure Integration Runtime (IR)
  • Stored procedures
  • SQL replication
  • Secondary readable Availability Groups
  • Azure Data Factory 2.0 (not 1.0, oh goodness, never 1.0)
  • Transactional movement featuring message queues or APIs
  • Any streaming solution
  • ETL or ELT
  • Any other kind of transformation I'm forgetting to mention
The following questions should be asked before designing a data movement plan.

(There are no correct answers to these questions of course, but you must be able to determine the answers from the business case.)

1. What is the latency requirement for the changes from the data source(s) to be copied to the destination?
Common answers: Instantly, no longer than 5 min, or 30 min, or nightly.

2. How many rows are expected to change in the source(s) in a given time period? 
Common answers: Anywhere from few rows per month to all/most the rows in a table every day.

3. What types of data changes are performed in the source(s)? 
Is the source data inserted, updated, and/or deleted? 

4. Do we have a reliable way to identify "the delta"? 
How do we know which rows have changed, including hard deleted rows (vs soft deleted rows)?

Let's dive more into the last question, because this is where the design meets the implementation method. There's a reason we always design tables with an IDENTITY column and some basic auditing fields.

First off, a truncate/insert strategy is not scalable. I have redesigned more of these strategies than I can remember, often because of database developer myopia. A truncate/reinsert strategy, even a bulk insert strategy, will inevitably outgrow its time boundary identified in Question 1. Don't waste your time and resources on such a strategy, you need to identify a way to find out what changed the in data source now.

But what if we don't or can't trust the application to always modify a "ChangeDate"? This is certainly the easiest way to know if the row has changed, but what if the original table wasn't designed with such a field? We should consider whether we can alter the data source(s) with useful, built-in SQL Server features like Change Tracking (CT), Change Data Capture (CDC), or a more recently-introduced feature called Temporal Tables. The latter can provide a reliable, built-in modified date and row history, transparent to applications. All of these strategies are well documented and have easy to use labs available.

Each of these solutions is very useful and recommended in its use case, and much preferred over a trigger-based system which will add complexity and overhead to transactions. A "pull" of recent changes is much preferred for most scenarios over a "push" of each change inside the transaction.

Caveats remain howeverand this came up with a recent clientthe impact on future updates/patches for databases must account for implementations of CT, CDC, or Temporal Tables. The same caveats apply to replication (useful in spots) and database triggers. Don't enable these SQL features without consulting with and advising the maintaining developers on the potential impact and need for testing.

One more crucial factor often overlooked as part of Question 4 are the intermediate transactions, especially in the case of less-than-instant data movement. If a row changes from status 1, to status 2, to status 3, can we just send over the row state with status 3? Or must we apply an insert for status 1, an update for status 2, and then another update for status 3 to the destination? This could be a major problem if the destination has an indirect dependency on evaluating the status changes; for example, to calculate the durations between statuses.

I once designed a data warehouse for tracking the performance of auditors, and we were analyzing the workflow for the bottlenecks in a 20-step process. Each of the 20 steps and its corresponding row state and time stamp were the whole point of the analysis. This demanded some sort of row-versioning in the data source. Not all change detection strategies work for this, however. Change Tracking, for example, would not suffice. Know your solutions!

You shouldn't move forward with any data movement design before answering these questions.

Are there any other common questions you'd ask for before deciding on a plan for a project like this?