Both my wife and I will be speaking this weekend at SQLSaturday Dallas, back-to-back in room 1.102, and both on topics perfect for entry level IT pros, job seekers, and students. We're honored to be on a SQLSaturday schedule filled with some of the biggest names in the industry, so I hope to see you there! Looking forward to seeing so many #SQLFamily there again.
I'll be presenting on SQL Security Principals and Permissions 101, a ground-level introduction to SQL Server authentication, security principals and even some security principles. We'll go through some code demos about how stored procedures assist the DBA by abstracting the permissions necessary for execution, and how you could design a database security model using role-based, least-permissions pattern. I think everyone can learn something from this session, from DBA's to developers, and how better to secure their databases.
See you at UT Dallas this Saturday, June 1!
Thursday, May 30, 2019
Wednesday, May 29, 2019
Untrusted FK's demo + ADS notebooks
Back in 2012 at SQLSaturday Pensacola, I did a lunch-time lightning talk on indexing, which was supposed to have a big finale on how FK's could silently help with performance, and it didn't work. Didn't show the performance boost. It was because, thanks to poor scripting by SSMS, FK's are enabled but not trusted. This blog post serves as a long-belated knowledge share of a problem I didn't figure out live, as well as a successful recreation of that big finale.
Fast forward to a couple weeks ago at the May 2019 Baton Rouge SQL Server User Group, and local fellow SQL pro John Wells gave a presentation to a mixed audience of dev's and DBA's on performance tips, and one of them was to check up on untrusted FK's. Thought occurred to me to create some lab's to demonstrate not only how to untrusted FK's get accidentally created by well-intentioned dev's and DBA's, but how to fix, and also a demonstration on how they can be silent performance drags. So with all respect to John, Chris Bell who John mentioned in his presentation, and many others who have spoken and written well on the topic already, below is a lab demonstrating the issue, resolution, and performance impact.
To further one more goal, I've built the lab into a tsql script but also a new Azure Data Studio SQL notebook. Either offer a step-by-step introduction and demonstration. New to SQL Notebooks? No worries, they were released generally in March 2019 as a new feature of Azure Data Studio (which can be launched from SSMS 18+), which itself was made GA in September 2018.
A SQL Notebook is another way to combine text instructions with step-by-step, modular executed SQL statements in a single file. Notebooks are a popular form of mixed media, combining text, proofs, queries, and their live interactive output. They are already quite popular tools for data scientists to package methodology, analysis and live output. SQL Notebooks include kernels for querying SQL, PySpark, Spark, Python with multiple languages, and now in preview for SQL 2019 CTP 3.0, working with big data clusters. SQL Notebooks could also be used by DBA's and developers as rich, well-documented runbooks for complex exercises deployment or disaster recovery exercises.
SQL Notebooks are easy to learn and use, consider a SQL Notebook for your next detailed query breakdown or summary data analysis. Maybe you'll learn two things at once!
Github links to the SparkhoundSQL toolbox for the Untrusted FK's lab: .sql .ipynb
PS Check it out, Github even parses the notebook JSON correctly to display a (non-interactive) view of the notebook outside of ADS with the saved resultsets. Neat!
UPDATE: Refined my terminology a little bit to favor "SQL Notebooks" not "ADS Notebooks"
Fast forward to a couple weeks ago at the May 2019 Baton Rouge SQL Server User Group, and local fellow SQL pro John Wells gave a presentation to a mixed audience of dev's and DBA's on performance tips, and one of them was to check up on untrusted FK's. Thought occurred to me to create some lab's to demonstrate not only how to untrusted FK's get accidentally created by well-intentioned dev's and DBA's, but how to fix, and also a demonstration on how they can be silent performance drags. So with all respect to John, Chris Bell who John mentioned in his presentation, and many others who have spoken and written well on the topic already, below is a lab demonstrating the issue, resolution, and performance impact.
To further one more goal, I've built the lab into a tsql script but also a new Azure Data Studio SQL notebook. Either offer a step-by-step introduction and demonstration. New to SQL Notebooks? No worries, they were released generally in March 2019 as a new feature of Azure Data Studio (which can be launched from SSMS 18+), which itself was made GA in September 2018.
A SQL Notebook is another way to combine text instructions with step-by-step, modular executed SQL statements in a single file. Notebooks are a popular form of mixed media, combining text, proofs, queries, and their live interactive output. They are already quite popular tools for data scientists to package methodology, analysis and live output. SQL Notebooks include kernels for querying SQL, PySpark, Spark, Python with multiple languages, and now in preview for SQL 2019 CTP 3.0, working with big data clusters. SQL Notebooks could also be used by DBA's and developers as rich, well-documented runbooks for complex exercises deployment or disaster recovery exercises.
SQL Notebooks are easy to learn and use, consider a SQL Notebook for your next detailed query breakdown or summary data analysis. Maybe you'll learn two things at once!
Github links to the SparkhoundSQL toolbox for the Untrusted FK's lab: .sql .ipynb
PS Check it out, Github even parses the notebook JSON correctly to display a (non-interactive) view of the notebook outside of ADS with the saved resultsets. Neat!
UPDATE: Refined my terminology a little bit to favor "SQL Notebooks" not "ADS Notebooks"
Monday, April 29, 2019
Four Data Integration Design Questions to Ask
We get requests to move data between custom systems regularly, even within systems. I was advising a client on something fairly simple—a collection of tables out of one vendor system to feed another—and I thought I'd make a blog post out of the array of questions that always come up.
Regardless of the strategy for data movement, whether it be:
(There are no correct answers to these questions of course, but you must be able to determine the answers from the business case.)
1. What is the latency requirement for the changes from the data source(s) to be copied to the destination?
Common answers: Instantly, no longer than 5 min, or 30 min, or nightly.
2. How many rows are expected to change in the source(s) in a given time period?
Common answers: Anywhere from few rows per month to all/most the rows in a table every day.
3. What types of data changes are performed in the source(s)?
Is the source data inserted, updated, and/or deleted?
4. Do we have a reliable way to identify "the delta"?
How do we know which rows have changed, including hard deleted rows (vs soft deleted rows)?
Let's dive more into the last question, because this is where the design meets the implementation method. There's a reason we always design tables with an IDENTITY column and some basic auditing fields.
First off, a truncate/insert strategy is not scalable. I have redesigned more of these strategies than I can remember, often because of database developer myopia. A truncate/reinsert strategy, even a bulk insert strategy, will inevitably outgrow its time boundary identified in Question 1. Don't waste your time and resources on such a strategy, you need to identify a way to find out what changed the in data source now.
But what if we don't or can't trust the application to always modify a "ChangeDate"? This is certainly the easiest way to know if the row has changed, but what if the original table wasn't designed with such a field? We should consider whether we can alter the data source(s) with useful, built-in SQL Server features like Change Tracking (CT), Change Data Capture (CDC), or a more recently-introduced feature called Temporal Tables. The latter can provide a reliable, built-in modified date and row history, transparent to applications. All of these strategies are well documented and have easy to use labs available.
Each of these solutions is very useful and recommended in its use case, and much preferred over a trigger-based system which will add complexity and overhead to transactions. A "pull" of recent changes is much preferred for most scenarios over a "push" of each change inside the transaction.
Caveats remain however—and this came up with a recent client—the impact on future updates/patches for databases must account for implementations of CT, CDC, or Temporal Tables. The same caveats apply to replication (useful in spots) and database triggers. Don't enable these SQL features without consulting with and advising the maintaining developers on the potential impact and need for testing.
One more crucial factor often overlooked as part of Question 4 are the intermediate transactions, especially in the case of less-than-instant data movement. If a row changes from status 1, to status 2, to status 3, can we just send over the row state with status 3? Or must we apply an insert for status 1, an update for status 2, and then another update for status 3 to the destination? This could be a major problem if the destination has an indirect dependency on evaluating the status changes; for example, to calculate the durations between statuses.
I once designed a data warehouse for tracking the performance of auditors, and we were analyzing the workflow for the bottlenecks in a 20-step process. Each of the 20 steps and its corresponding row state and time stamp were the whole point of the analysis. This demanded some sort of row-versioning in the data source. Not all change detection strategies work for this, however. Change Tracking, for example, would not suffice. Know your solutions!
You shouldn't move forward with any data movement design before answering these questions.
Are there any other common questions you'd ask for before deciding on a plan for a project like this?
Regardless of the strategy for data movement, whether it be:
- SQL Server Integration Services (SSIS) locally or in Azure Integration Runtime (IR)
- Stored procedures
- SQL replication
- Secondary readable Availability Groups
- Azure Data Factory 2.0 (not 1.0, oh goodness, never 1.0)
- Transactional movement featuring message queues or APIs
- Any streaming solution
- ETL or ELT
- Any other kind of transformation I'm forgetting to mention
(There are no correct answers to these questions of course, but you must be able to determine the answers from the business case.)
1. What is the latency requirement for the changes from the data source(s) to be copied to the destination?
Common answers: Instantly, no longer than 5 min, or 30 min, or nightly.
2. How many rows are expected to change in the source(s) in a given time period?
Common answers: Anywhere from few rows per month to all/most the rows in a table every day.
3. What types of data changes are performed in the source(s)?
Is the source data inserted, updated, and/or deleted?
4. Do we have a reliable way to identify "the delta"?
How do we know which rows have changed, including hard deleted rows (vs soft deleted rows)?
Let's dive more into the last question, because this is where the design meets the implementation method. There's a reason we always design tables with an IDENTITY column and some basic auditing fields.
First off, a truncate/insert strategy is not scalable. I have redesigned more of these strategies than I can remember, often because of database developer myopia. A truncate/reinsert strategy, even a bulk insert strategy, will inevitably outgrow its time boundary identified in Question 1. Don't waste your time and resources on such a strategy, you need to identify a way to find out what changed the in data source now.
But what if we don't or can't trust the application to always modify a "ChangeDate"? This is certainly the easiest way to know if the row has changed, but what if the original table wasn't designed with such a field? We should consider whether we can alter the data source(s) with useful, built-in SQL Server features like Change Tracking (CT), Change Data Capture (CDC), or a more recently-introduced feature called Temporal Tables. The latter can provide a reliable, built-in modified date and row history, transparent to applications. All of these strategies are well documented and have easy to use labs available.
Each of these solutions is very useful and recommended in its use case, and much preferred over a trigger-based system which will add complexity and overhead to transactions. A "pull" of recent changes is much preferred for most scenarios over a "push" of each change inside the transaction.
Caveats remain however—and this came up with a recent client—the impact on future updates/patches for databases must account for implementations of CT, CDC, or Temporal Tables. The same caveats apply to replication (useful in spots) and database triggers. Don't enable these SQL features without consulting with and advising the maintaining developers on the potential impact and need for testing.
One more crucial factor often overlooked as part of Question 4 are the intermediate transactions, especially in the case of less-than-instant data movement. If a row changes from status 1, to status 2, to status 3, can we just send over the row state with status 3? Or must we apply an insert for status 1, an update for status 2, and then another update for status 3 to the destination? This could be a major problem if the destination has an indirect dependency on evaluating the status changes; for example, to calculate the durations between statuses.
I once designed a data warehouse for tracking the performance of auditors, and we were analyzing the workflow for the bottlenecks in a 20-step process. Each of the 20 steps and its corresponding row state and time stamp were the whole point of the analysis. This demanded some sort of row-versioning in the data source. Not all change detection strategies work for this, however. Change Tracking, for example, would not suffice. Know your solutions!
You shouldn't move forward with any data movement design before answering these questions.
Are there any other common questions you'd ask for before deciding on a plan for a project like this?
Subscribe to:
Posts (Atom)