Monday, November 1, 2010

How Big & Full Is Your Data Warehouse?

I ask this question at every client that hires me to enhance their Microsoft SQL Server Business Intelligence solution. It seems like a pretty easy and straight forward question right?  You would be surprised at how many companies or BI teams within companies struggle to answer this question easily or have up to date information.  The answer usually requires a DBA to find, copy, or manually create an excel sheet containing the data file information, then use formulas to sum it all up.  There is nothing wrong with this approach but it is time consuming.

There are usually several data marts or a warehouse that is physically split up over many different databases, and it can take a few days to figure this out if there are many data files.  Another question I ask is how full are your data marts or warehouse? You would be surprised at how many companies or BI teams struggle with this information or just don’t know that it is important.  This info can help with capacity planning and performance as well.  If your data files have to grow a few times a week to meet increasing data needs you are taking a performance hit each time the data file has to grow.

To simplify these questions, I wrote a script a few years ago that make the answers easy to attain. Disclaimer: I gathered some of the guts of the #datainfo T-SQL from a SQL Server forum a few years ago. Credit goes to the author of it as it got me thinking to create this script and his / her thought process was very cool.

I hope this script helps those out that struggle with these questions. This script can obviously be used to answer any database space question or modified to suit your needs rather easily.

This script will gather all the database sizes and log file sizes on a SQL server rolled up.

The results look like (click to make larger):

Thursday, October 21, 2010

SSIS Server Architecture Depends on Package Design

A common question that my clients ask when they purchase hardware for the ETL Layer of a Business Intelligence architecture using SQL Server Integration Services (SSIS) is usually, “does this server need to be attached to SAN disk or a dedicated external disk, or will the server’s local disk suffice?”   The answer is “it depends” and I will explain why.

Disclaimer: I usually recommend a dedicated SSIS server for exclusively running ETL packages for larger environments, or for smaller environments that have the potential to become large. For throughput, recovery, etc. SAN disk or a dedicated external disk is usually a better option than local server disk. With budgets being tight in this economy it is not always an option so I won’t make this a “SAN or not” article, because “it also depends”.I find that there are two schools of thought regarding SSIS package design in the last few years.  The first perspective is usually from the Database Engineers – DBAs who think that anything related to data should use execute SQL Tasks for the most part, utilizing the database engine with optimized transact SQL (T-SQL).  The second perspective is from the SSIS Developer that believes packages should use lookups, data flow tasks, script tasks, raw data files, etc., and execute sql tasks should be used as little as possible.

Another Disclaimer: I can relate to both perspectives as I have filled both roles on BI projects and this is a debate that I have seen at more than one client and am sure others have this discussion as well, so why not blog about it and hopefully it can help.

Drum roll please – the answer is both the Database Engineers – DBAs and SSIS Developers are correct.  From the SSIS Developers point of view everything is done and contained within the SSIS packages and local disk will suffice since most operations are done in memory so there is no need for SAN or external dedicated disk.  From the Database Engineers – DBAs point of view, the SSIS server can be configured to have the database engine present with attached SAN or external dedicated disk and use staging database(s) that run optimized T-SQL within execute SQL tasks.

The point is that it is not a clear cut question and answer and it really depends on a lot of factors. If you are the person making the architectural decisions for the SSIS server, you should understand how the packages will be designed first to answer this question correctly.

Wednesday, October 13, 2010

Divide and Conquer Reporting By Scaling Out SQL Server with Replication

This is my first blog post and surely won’t be my last. I have talked about creating a SQL BI blog for a few years and over the last few weekends I finally gathered my thoughts and documents I wrote over the years to share, so here goes the RTM version.  If there are any errors, they will be fixed in SP1 :-)

A common situation I encounter at various clients is that they perform heavy reporting out of the same transactional databases that power revenue driving sites.  This usually hinders the performance and decreases revenue generation, not a good situation.  The cost of additional hardware or additional SQL Server licenses is sometimes hard to justify, but if its possible to show that slow transactional throughput hinders revenue generation, or if the lack of timely reporting is limiting visibility for additional revenue opportunities, it might be a good place to start.

So now that you have wowed the folks with budgets that can buy servers and SQL Server licenses with this amazing information, how do you take the data and have it duplicated to your new SQL Server in real time so reporting can be performed there? Real time was the requirement for these particular clients The answer is simple, transactional replication.  I am using transactional replication in this example because the last two times I provided this architectural solution to my clients, one was still using SQL 2000 and the other only needed a few tables of a very large SQL 2005 database. You can use snapshot replication, database mirroring for SQL 2005 or later, or even log shipping. It all depends on the requirements but that won’t be talked about in this example and you should research each option for your situation.

To implement SQL Server Replication you need to setup the database server that you would like to replicate as a publisher, you need a distributor this manages the transactions in terms of reading, storing and sending, and you also need a subscriber this contains your duplicated database(s) you will use for future reporting. 

Clients that I have provided consulting services for struggled with the next step of the replication architecture on more than one occasion; that is where to configure the distributor and subscriber. This is where it gets interesting, you may or may not need a separate server for each. As a general rule of thumb for smaller publisher database implementations small in size, small number of concurrent users, small number of transactions, I usually place the distributor and subscriber on one server separate from the publisher. For larger database implementations large in size, large number of concurrent users, large number of transactions, I usually place the distributor on its own server and the subscriber on its own server both separate from the publisher.  I’ll say it again, always separate the distributor and subscriber from the publisher as that usually avoids additional overhead on the publisher. Your original goal was to take load off the publisher with this architectural solution and there is a reason you are dividing and conquering, not adding then dividing and conquering in my opinion. You should test your scenario thoroughly before deciding on the final replication architecture and see what works best for you.

Another transactional replication recommendation I have is that after your publisher, distributor, and subscriber are all setup and you begin to add the tables as articles you would like to replicate I would ensure that you DO NOT use separate distribution agents per publication if you set up your transactional replication as per table publications The reason is parent child relationships. For example, a parent table is heavily transactional and the child table is lightly transactional, the child records can be delivered to the subscriber database before the parent records are delivered and reports that need to join the two for a complete data set will not be able to make the join and lead to missing results.

I feel like I could write a lot more on this topic with code examples etc. but am trying to keep it architectural in nature.  Also, there are always exceptions to the above but generally speaking in my experience this is what worked for my clients.  I hope this helps you divide and conquer your transactional database reporting.