Wednesday, October 29, 2014

CLR Types in SQL Server - Investigation Only

Today, I'm finally digging into a topic that I've wanted to look at for a while:  Using CLR Types in SQL Server.  I've been interested in this topic for a while now; especially when I see very complicated stored procedures implementing business logic that would be simple in C#.  This may very well be that the teams I've worked with simply do not know the 'proper' way of writing these procedures and therefore cannot achieve optimal performance and/or maintainability / readability.

So, I'm a developer that is very comfortable in C#, but moderately comfortable with SQL, but mainly from an analytics stand point.  Denormalizing tables and building cubes is one thing, creating a highly performant transactional system is definitely another.  So any example SQL code that I may use for comparison is coming from a software developer's mind, not a SQL Server administrator's mind.

Here's the scenario:

As a developer primarily focused in C# you are tasked with refactoring a rules engine built in SQL Server using Stored Procedures filled with business logic and tables to provide rules used by the Stored Procedures.

My reaction: 

I'm going to mess up this SQL, but I know I can do it in C# with unit tests!

However, being the practical guy that I am, I first turned to the community for their opinion.  Overall, it seems the community favors business logic in an application / business logic middle-tier.  I am included in that group, however, I'll try to consider the alternative and it's merits.

Having stored procedures in SQL Server allows the processing of the logic to be close to the data.  There is no client application as we are effectively ETL-ing data from source to the Data Warehouse so another tier would be required where using procs that tier is unnecessary.  Since it's ETL, there is a lot of scanning over batches or records and data access from T-SQL is going to be faster than other approaches.

CLR routines cover the first two conditions where the logic would be close to the data and there would not be another tier required.  It would introduce an assembly to care for, however.  While T-SQL is going to be faster for data access, processing the rules may be way faster in C# using the CLR.  Typically, we process nearly continuously so the amount of data to process is small.  If we have to reprocess all the historic data, we'll see if the read performance improvement of T-SQL out weighs processing performance of C# using the CLR.

Bottom line is there are 3 options:

1.  Improve my SQL skills to be able to write complex business rules in Stored Procedures.
2.  Write the rules in C# and register an assembly to SQL Server.  This would be able to be marked as SAFE as it simply ETLs from source to DW.
3.  Create an application / business rules layer using C# that processes the data continuously.

#2 seems the most realistic option for me so we'll investigate that now!  :)

From the Docs

As with most Microsoft things that I dive into, I like to start with a deep dive of the info available on MSDN.

Performance

To my very happy surprise, on the first page of the docs
  • Potential for improved performance and scalability. In many situations, the .NET Framework language compilation and execution models deliver improved performance over Transact-SQL.
Now to figure out what those situations are...  As I continued into the section about Performance of CLR Integration,

CLR functions benefit from a quicker invocation path than that of Transact-SQL user-defined functions. Additionally, managed code has a decisive performance advantage over Transact-SQL in terms of procedural code, computation, and string manipulation. CLR functions that are computing-intensive and that do not perform data access are better written in managed code. Transact-SQL functions do, however, perform data access more efficiently than CLR integration.

Ugh, right back where I started.  The function is computing intensive, but requires "lots" of data access.  Again, the typical scenario being just the delta since last processing finished; is that really a lot?  Even reprocessing all historic data... is it a lot?  And are the rules complex enough to notice any performance benefit?

Then, we always have to keep in mind there are other costs aside from performance.  Is the cost of the stored procedure approach (readability / maintainability / team skills) higher than the cost of the CLR approach and does that cost out-weigh any potential performance benefits?

So far, more questions than answers for sure.  At least for now I know it has the potential to be a good solution.

Security

There are a few security mechanisms available for restricting code run by an assembly.

This assembly will only ETL data from a staging table with the delta to Source since the last processing completed so we can use SAFE.  Safe is the most restrictive setting and cannot access resources outside of SQL Server.  You can read about the other options on MSDN using the above link.

When registering the assembly with SQL Server, it will ensure that the assembly does not contain code that would not be allowed to run when called.  This prevents run time errors from occurring due to using external resources.  Luckily, it can detect code that will not be called and will allow the assembly to be loaded even if that un-callable code contains code that would otherwise cause the assembly to error when loading.

SQL Server Types in CLR

The SqlTypes namespace is used to provide the same semantics and precision found in SQL Server.  This is especially important for NULLability when comparing, numeric precision, and streaming LOB types.  The type mapping between CLR and SQL Server can be found here.

I'm interested in the streaming LOB and Streaming Table Valued Functions (STVFs).  I did not realize such a thing existed and I'm wondering if this could help performance as well.

Versioning and User Defined Persisted Types

One of the features I was unaware of is that you can create CLR data objects and save them directly into a column of that type.  There is a SqlUserDefinedType that can be used to describe how to save the object.

Because UDTs are accessed by the system as a whole, their use for complex data types may negatively impact performance. Complex data is generally best modeled using traditional rows and tables. UDTs in SQL Server are well suited to the following:
  • Date, time, currency, and extended numeric types
  • Geospatial applications
  • Encoded or encrypted data
Ok, so what does "accessed by the system as a whole" mean...  Further what is complex, anything with a child object?  Does JSON serialized data count as encoded?

User Defined types must implement the INullable interface in SqlTypes as SQL Server is 'null-aware'.  There are a ton of other restrictions as well.

UDTs can be read from a data reader and cast to the object's type directly.  The client code must, of course, have a reference to the assembly containing the UDT.  The object can also be read as bytes and no reference is required in that case.

SqlParameter can have its type set as SqlDbType.Udt and that type can be passed as input to the procedure.

Overall, my first impression of UDTs is that they are extremely complicated for anything beyond simple data types.

CLR Methods

Custom CLR Attributes

Methods in a CLR routine can be attributed with SqlFunction attribute and specify a DataAccess parameter to help SQL Server understand the method's intent (eg, read data from a table).

CLR vs T-SQL Table Valued Functions

T-SQL TVFs return a table that can have unique indexes and constraints since it is an actual table.  CLR does not do this, but instead has a streaming interface via IEnumerable.  As soon as the first record is available, the calling code can start consuming.

A user-defined table type cannot be passed as a table-valued parameter to, or be returned from, a managed stored procedure or function executing in the SQL Server process.

Can a UDT in CLR be used?

Pass by reference when using OUT parameters.


CLR Stored Procedures

Returning Tabular Results and Messages

Returning tabular results and messages to the client is done through the SqlPipe object, which is obtained by using the Pipe property of the SqlContext class. The SqlPipe object has a Send method. By calling the Send method, you can transmit data through the pipe to the calling application.

Returning Tabular Results

To send the results of a query directly to the client, use one of the overloads of the Execute method on the SqlPipe object. This is the most efficient way to return results to the client, since the data is transferred to the network buffers without being copied into managed memory.

Implementation

Unfortunately, I was unable to refactor the SQL Rules to compare against the CLR routines as the procedures were beyond my ability to comprehend.  I do hope to return to this topic soon, but at least my initial investigation and notes are here for when that day comes! :)

No comments:

Post a Comment