Wednesday, October 29, 2014

CLR Types in SQL Server - Investigation Only

Today, I'm finally digging into a topic that I've wanted to look at for a while:  Using CLR Types in SQL Server.  I've been interested in this topic for a while now; especially when I see very complicated stored procedures implementing business logic that would be simple in C#.  This may very well be that the teams I've worked with simply do not know the 'proper' way of writing these procedures and therefore cannot achieve optimal performance and/or maintainability / readability.

So, I'm a developer that is very comfortable in C#, but moderately comfortable with SQL, but mainly from an analytics stand point.  Denormalizing tables and building cubes is one thing, creating a highly performant transactional system is definitely another.  So any example SQL code that I may use for comparison is coming from a software developer's mind, not a SQL Server administrator's mind.

Here's the scenario:

As a developer primarily focused in C# you are tasked with refactoring a rules engine built in SQL Server using Stored Procedures filled with business logic and tables to provide rules used by the Stored Procedures.

My reaction: 

I'm going to mess up this SQL, but I know I can do it in C# with unit tests!

However, being the practical guy that I am, I first turned to the community for their opinion.  Overall, it seems the community favors business logic in an application / business logic middle-tier.  I am included in that group, however, I'll try to consider the alternative and it's merits.

Having stored procedures in SQL Server allows the processing of the logic to be close to the data.  There is no client application as we are effectively ETL-ing data from source to the Data Warehouse so another tier would be required where using procs that tier is unnecessary.  Since it's ETL, there is a lot of scanning over batches or records and data access from T-SQL is going to be faster than other approaches.

CLR routines cover the first two conditions where the logic would be close to the data and there would not be another tier required.  It would introduce an assembly to care for, however.  While T-SQL is going to be faster for data access, processing the rules may be way faster in C# using the CLR.  Typically, we process nearly continuously so the amount of data to process is small.  If we have to reprocess all the historic data, we'll see if the read performance improvement of T-SQL out weighs processing performance of C# using the CLR.

Bottom line is there are 3 options:

1.  Improve my SQL skills to be able to write complex business rules in Stored Procedures.
2.  Write the rules in C# and register an assembly to SQL Server.  This would be able to be marked as SAFE as it simply ETLs from source to DW.
3.  Create an application / business rules layer using C# that processes the data continuously.

#2 seems the most realistic option for me so we'll investigate that now!  :)

From the Docs

As with most Microsoft things that I dive into, I like to start with a deep dive of the info available on MSDN.

Performance

To my very happy surprise, on the first page of the docs
  • Potential for improved performance and scalability. In many situations, the .NET Framework language compilation and execution models deliver improved performance over Transact-SQL.
Now to figure out what those situations are...  As I continued into the section about Performance of CLR Integration,

CLR functions benefit from a quicker invocation path than that of Transact-SQL user-defined functions. Additionally, managed code has a decisive performance advantage over Transact-SQL in terms of procedural code, computation, and string manipulation. CLR functions that are computing-intensive and that do not perform data access are better written in managed code. Transact-SQL functions do, however, perform data access more efficiently than CLR integration.

Ugh, right back where I started.  The function is computing intensive, but requires "lots" of data access.  Again, the typical scenario being just the delta since last processing finished; is that really a lot?  Even reprocessing all historic data... is it a lot?  And are the rules complex enough to notice any performance benefit?

Then, we always have to keep in mind there are other costs aside from performance.  Is the cost of the stored procedure approach (readability / maintainability / team skills) higher than the cost of the CLR approach and does that cost out-weigh any potential performance benefits?

So far, more questions than answers for sure.  At least for now I know it has the potential to be a good solution.

Security

There are a few security mechanisms available for restricting code run by an assembly.

This assembly will only ETL data from a staging table with the delta to Source since the last processing completed so we can use SAFE.  Safe is the most restrictive setting and cannot access resources outside of SQL Server.  You can read about the other options on MSDN using the above link.

When registering the assembly with SQL Server, it will ensure that the assembly does not contain code that would not be allowed to run when called.  This prevents run time errors from occurring due to using external resources.  Luckily, it can detect code that will not be called and will allow the assembly to be loaded even if that un-callable code contains code that would otherwise cause the assembly to error when loading.

SQL Server Types in CLR

The SqlTypes namespace is used to provide the same semantics and precision found in SQL Server.  This is especially important for NULLability when comparing, numeric precision, and streaming LOB types.  The type mapping between CLR and SQL Server can be found here.

I'm interested in the streaming LOB and Streaming Table Valued Functions (STVFs).  I did not realize such a thing existed and I'm wondering if this could help performance as well.

Versioning and User Defined Persisted Types

One of the features I was unaware of is that you can create CLR data objects and save them directly into a column of that type.  There is a SqlUserDefinedType that can be used to describe how to save the object.

Because UDTs are accessed by the system as a whole, their use for complex data types may negatively impact performance. Complex data is generally best modeled using traditional rows and tables. UDTs in SQL Server are well suited to the following:
  • Date, time, currency, and extended numeric types
  • Geospatial applications
  • Encoded or encrypted data
Ok, so what does "accessed by the system as a whole" mean...  Further what is complex, anything with a child object?  Does JSON serialized data count as encoded?

User Defined types must implement the INullable interface in SqlTypes as SQL Server is 'null-aware'.  There are a ton of other restrictions as well.

UDTs can be read from a data reader and cast to the object's type directly.  The client code must, of course, have a reference to the assembly containing the UDT.  The object can also be read as bytes and no reference is required in that case.

SqlParameter can have its type set as SqlDbType.Udt and that type can be passed as input to the procedure.

Overall, my first impression of UDTs is that they are extremely complicated for anything beyond simple data types.

CLR Methods

Custom CLR Attributes

Methods in a CLR routine can be attributed with SqlFunction attribute and specify a DataAccess parameter to help SQL Server understand the method's intent (eg, read data from a table).

CLR vs T-SQL Table Valued Functions

T-SQL TVFs return a table that can have unique indexes and constraints since it is an actual table.  CLR does not do this, but instead has a streaming interface via IEnumerable.  As soon as the first record is available, the calling code can start consuming.

A user-defined table type cannot be passed as a table-valued parameter to, or be returned from, a managed stored procedure or function executing in the SQL Server process.

Can a UDT in CLR be used?

Pass by reference when using OUT parameters.


CLR Stored Procedures

Returning Tabular Results and Messages

Returning tabular results and messages to the client is done through the SqlPipe object, which is obtained by using the Pipe property of the SqlContext class. The SqlPipe object has a Send method. By calling the Send method, you can transmit data through the pipe to the calling application.

Returning Tabular Results

To send the results of a query directly to the client, use one of the overloads of the Execute method on the SqlPipe object. This is the most efficient way to return results to the client, since the data is transferred to the network buffers without being copied into managed memory.

Implementation

Unfortunately, I was unable to refactor the SQL Rules to compare against the CLR routines as the procedures were beyond my ability to comprehend.  I do hope to return to this topic soon, but at least my initial investigation and notes are here for when that day comes! :)

Saturday, October 11, 2014

Intro to Tasks

Today I'd like to look at a very popular topic in the .net community:  Tasks.  Most people like to focus on the async and await keywords, but I find myself typically using Tasks directly for anything that is not trivial.
I struggled way more than I should have when I first investigated the Task<T> return type in conjunction with the async and await keywords.  I did not understand what the async keyword really meant and that the await keyword applied to the Task type and not the async method signature.  So here's what it took for me to understand tasks.

The async keyword is applied to a method that, in the method's body, uses the await keyword.  Here's an example:

public async Task<Client> GetCustomerByName(string name){
    if(String.IsNullOrWhitespace(name)){
        throw new ArgumentException();
    }
 
    Client client = await new ClientRepository.Fetch(new FindClientByNameCriteria(name));
    return client.IsActive ? client : null;
}

There are two observations worth discussing here.  Firstly, you'll also notice that the client object is unwrapped from the Fetch method which returns Task<Client> because of the await keyword (it unwraps the result for you).  Secondly, the return type of the method is Task<Client>, but in the method body, we return a Client object, but after the await keyword is used.  This will wrap the return in a Task.  If there were a return before the await keyword is used, then you would need to return a Task<Client> object instead.  Let's look at that now.

public async Task<Client> GetCustomerByName(string name){
    if(String.IsNullOrWhitespace(name)){
        throw new ArgumentException();
    }

    // you would never do this, but for the sake of an example....
    if(StringComparer.OrdinalIgnoreCase.Equals(name, AdminUserName)){
        return Task.FromResult(AdminUser);
    }
 
    Client client = await new ClientRepository.Fetch(new FindClientByNameCriteria(name));
    return client.IsActive ? client : null;
}

The thought behind async Task methods is that all the code until the first await is found is executed synchronously, typically to ensure that the call has a chance to succeed or any pre-processing is completed before the rest of the method is executed asynchronously.

Let's rewrite this without using async or await.

public Task<Client> GetCustomerByName(string name){
    if(String.IsNullOrWhitespace(name)){
        throw new ArgumentException();
    }
 
    return new ClientRepository.Fetch(new FindClientByNameCriteria(name))
                                           .ContinueWith(x => x.Result.IsActive ? x.Result : null);
}

First, notice that we do not use the async keyword on this method because we are not using the await keyword in the method body.

Abstractly, you can think of the await keyword as a more terse way of specifying the continuation.  Everything after an await will effectively go into a ContinueWith.  If there are mulitple awaits, there would be multiple ContinueWiths.
Therefore, whenever you use the await keyword, the methd will return a Task since ContinueWith returns a Task.  Since everything in the method following the await is placed into a Continuation, there is no way to return anything else.

It may be strange, at first, to return the object directly in an async method, but consider the Continuation approach.  Inside the continuation, we just return whatever object we want, and it's wrapped in a Task.  This is the same, since your code is wrapped in a continuation, you just return the object you want and it will be wrapped in a Task.

OK, so now that we've covered the basics, let's look at a few special cases or best practices when using Tasks.

Methods with void return type

You may be wondering how to convert a method that has a void return type into an async method since the return type must be a Task.  Luckily, there is a non-generic Task that comes to the rescue.

Microsoft recommends that async methods return Task instead of void, with the exception of Event Handlers (ie, callbacks).  I assume the reasoning here is that the method that is 'calling back' does not need to know when the call back has completed execution and callbacks returning a value is a very strange paradigm.

Async event handlers also come with very non-intuitive exception handling.  Typically, exceptions in a Task-returning method will be wrapped in the exception and returned.  In an async event handler, the exception is thrown in the original SynchronizationContext so you cannot easily catch an exception raised by an async event handler.

Aside from event handlers, you should convert void methods to returning the non-generic Task type.  This allows the caller to be signaled when the method has finished executing, investigate the task for error, and specify continuation functions on success and failure.

Async All the Way

Let me start by clearly stating that this is by far the most problem laden aspect of asynchronous programming:  Context.  We don't often have to worry much about context in synchronous programming; everything is in the same context and only one thing is executing at a time.  When we move into asynchronous programming, everything comes down to context.  Even more annoying, code that works just fine in a console application will fail with dead locks in GUI and web applications.

The easiest way to avoid this problem is to be asynchronous everywhere.  Deadlock context issues tend to arise when using the Wait() method or Result property of a task.  It can be the case that the thread that calls Wait() is also the thread that has to process the task that Wait() was called on.  The thread will block waiting on itself to finish the task and you are deadlocked!  The thread cannot complete the Task's work because it is blocked waiting for that work to be completed.  Going async all the way avoids calls to Result and Wait.

Note:  Using Result in a continuation is perfectly acceptable as the task has finished and the result is guaranteed to be available.

You'll find that as you are converting your code to be asynchronous, it is naturally 'contagious' and code calling asynchronous methods tends to want to be asynchronous also (typically via continuation).

One of the nicer aspects of this approach is that the continuation is nearby the code that it is continuing.  Often, the continuation applies only to that block of code and is not re-used anywhere.  It is frustrating to have to write a method for the callback and specify it to execute after the called code is finished since they no longer feel related and there is no way to tell that the method only applies to that continuation.

Consider the difference:

public Task<CustomerDto[]> FetchCustomers(Criteria<Customer> criteria) {
    new CustomerRespository.Fetch(criteria)
                                             .ContinueWith(x => x.Result.Select(y => new CustomerDto(y))
                                                                                          .ToArray());
}

vs

private static CustomerDto[] ConvertCustomers(Task<Customer[]> customers){
     return customers.Select(y => new CustomerDto(y)).ToArray();
}
public Task<CustomerDto[]> FetchCustomers(Criteria<Customer> criteria){
    new CustomerRespository.Fetch(criteria)
                                             .ContinueWith(ConvertCustomers);
}

vs

private static CustomerDto[] ConvertCustomers(IEnumerable<Customer> customers){
     return customers.Select(y => new CustomerDto(y)).ToArray();
}

public async Task<CustomerDto[]> FetchCustomers(Criteria<Customer> criteria){
    var customers = await new CustomerRespository.Fetch(criteria);
    return ConvertCustomers(customers);
}

I tend to use the first approach as it contains all the logic for that method.

Awaiter Configuration

This is another annoying implementation details.  Tasks have a method, ConfigureAwaiter(bool), that specifies whether the code in the continuation needs to execute on the same thread as the calling code.  It drives me crazy that the default is not false, especially when it's best practice to set this to false.  I would think that code that required same context would need to specify it since that is the minority of cases.  This is another reason that I favor continuations directly over using the async keyword.

If you are developing winform applications, you may want the context that called the code to also handle processing the code.  This is true especially when manipulating any GUI elements or bound properties.  In web applications, using the HttpContext is a good reason to require same context.  One way around that is to capture any variables in the context locally before awaiting.

Context free code is typically easier to manage and maintain.  A strategy for converting context sensitive code into more manageable asynchronous code is to put the core logic of the method into an async method and call that from the context sensitive method, requiring same context.


I think that about sums up what I have to say for today about Tasks.  I know I'll be revisiting this topic in the future with some better examples, but for now... Enjoy!

Friday, October 3, 2014

Variance

I want to discuss the importance of variance in the context of generic types arguments.

Starting in C# 4.0, we have two new keywords when declaring generic types:  in and out.

in  is used to indicate that a type parameter is contravariant, or that the type is only accepted as "in put".  An example of a contractvariant interface in the framework is the IEqualityComparer<in T>.  The equality comparer is able to accept two instances of T as input to the compare method.  There is no way to return a T from the equality comparer.

out is used to indicate that a type parameter is covariant, or that the type is only returned as "output".  An example of a covariant interface in the framework is the IEnumerable<out T>.  The enumerable is only able to return instances of T; there is no way to pass T to any method of the enumerable.

Knowing that a type parameter is covariant or contravariant is extremely useful for abstraction.  If a parameter is neither covariant or contravariant, it is known as invariant.

When we know that the type will only be accepted to as input, the method can accept any further derived variant of the type knowing it can reliably interact with that type as those it were the less derived type.

Consider the case of an equality comparer and a class structure as follows:

public class LivingThing
{
    public LivingThing(string kingdom){ this.Kingdom = kingdom; }
    public string Kingdom {get; private set;}
}

public class Mammal : LivingThing
{
    public Mammal(string family) : base("Animalia"){ this.Family = family; }
    public string Family {get;set;}
}

public class Dog : Mammal
{
    public Dog(string breed) : base("Canidae"){ this.Breed = breed; }
    public string Breed {get;set;}
}

public class Cat : Mammal
{
    public Cat(string breed) : base("Felidae"){ this.Breed = breed; }
    public string Breed {get;set;}
}

public class LivingThingComparer : IEqualityComparer<LivingThing>
{
    public bool Equals(LivingThing x, LivingThing y){
        if(Object.ReferenceEquals(x, y)) return true;
        if(null == x || null == y) return false;
        return StringComparer.OrdinalIgnoreCase.Equals(x.Kingdom, y.Kingdom);
    }
}

var someAnimal = new LivingThing("Animalia");
var someFungus = new LivingThing("Fungus");

new AnimalComparer().Equals(someAnimal, someFungus); // returns false

Comparing two living things is based on the kingdom to which the thing belongs.  Fungus and animals are not equal.  However, this is not very useful; often we want to compare more derived things than some animal and some fungus.

var dog = new Dog("Pug");

new AnimalComparer().Equals(someAnimal, dog); // returns true

Both the some animal and the dog belong to the Animalia Kingdom and therefore are equals as far as LivingThings go.

var cat = new Cat("Persian");

new AnimalComparer().Equals(cat, dog); // returns true

So, while this may be strange to consider a Persian Cat and Pug Dog as the same, but in terms of living things at a Kingdom level, they are the same and we can count on the Kingdom property being available on any derive LivingThing.

public class MammalComparer : IEqualityComparer<Mammal>
{
    public bool Equals(Mammal x, Mammal y){
        if(Object.ReferenceEquals(x, y)) return true;
        if(null == x || null == y) return false;
        return StringComparer.OrdinalIgnoreCase.Equals(x.Family, y.Family);// no need to check kingdom
    }
}

new MammalComparer().Equals(cat, dog);// returns false
new MammalComparer().Equals(someAnimal, dog); // does not compile

someAnimal is a LivingThings which is less derived than a Mammal so the comparer has no idea how to compare them.  The Family property by which the comparer decides equality does not exist on LivingThing.

OK.  Now that we understand contravariance, let's look at covariance using the enumerable case and same class structure as above.

IEnumerable<Mammal> collectionOfMammals = new List<Cat>(){ new Cat("Persian"), new Cat("Siamese") };
collectionOfMammals = new Dog[]{ new Dog("Pug"), new Dog("Bulldog") };

The same collection of mammals can be set to a derived collection of a derived type of mammal.  It is safe to allow this assignment because the client will only be able to gain access to the objects in the collection as Mammals, which both Cat and Dog are Mammals.  We can even mix the bag.

collectionOfMammals = new List<Cat>(){ new Cat("Persian"), new Cat("Siamese") }.Cast<Mammal>().Concat(new Dog[]{ new Dog("Pug"), new Dog("Bulldog") });

Because the client only has access to the Mammal abstraction of the object, it is safe to mix the bag.  Every property the client could use would be available to both types.  The danger would be when trying to add to the underlying collection when the actual type is different.

foreach(var mammal in collectionOfMammals){
    Console.WriteLine(mammal.Breed);// doesn't compile; while both cat and dog have this property, it is not visible behind the mammal abstraction
    Console.WriteLine(mammal.Family); // all mammals will have this property so safe to use
}

collectionOfMammals.Add(new Dog("Poodle")); // doesn't compile; the List methods that would make the enumerable invariant is hidden behind the enumerable abstraction

Assume that the enumerable interface had the ability to add items and was still marked as covariant.

IEnumerable<Mammal> mammals = new List<Dog>() { new Dog("Pug") };
mammals.Add(new Cat("Persian"));

At first it may seem like what's the harm; we've declared the variable as enumerble of mammals and we're adding a mammal.  However, the actual type is List<Dog> and you cannot add a Cat to this collection type.

OK.  Now we know what co- and contra- variant type parameters are, but so what?  Well, let me tell you... Before these keywords were available, life was a hellish nightmare when using immutable generics that had nested generic parameters.

Consider a simple example that often drives me crazy:

IDictionary<string, IEnumerable<string>> dict = new Dictionary<string, IList<string>>();// doesn't compile
// IDictionary<TKey, TValue>, TValue is not covariant do to the Add(TKey, TValue) method so you cannot use a further derived type as the TValue (IList<String> : IEnumerable<String>)

This at least makes sense, I could use the Add method and add the wrong type of collection to the actual type (Dictionary<string, IList<string>>).  Sometimes I'd rather the framework allow this and throw a runtime exception if the client tries to call Add with an invalid collection type.

Task<T> is probably the most frustrating example of not using covariance.  If Task<T> was implementing ITask<out T>, life would be much better.  There are so many cases where I'd like to do the following:

Task<IEnumerable<string>> StringTask(){
    return Task.FromResult(new List<string>(){ "A", "B", "C" });
} // doesn't compile, Task<List<String>> cannot be assigned to Task<IEnumerable<String>>

Obviously I wouldn't want to just do from result, but imagine I'm awaiting something that returns a list and I want to expose it as IEnumerable.  I have to actually downcast the list to IEnumerable<string> before returning.   It gets waaay worse with nested generic types and you have to re-create the entire nested structure and downcast to appropriate abstraction.

Well, that's about all I have to say about variance for now.  I'm sure I'll be touching this subject again in a future post.  Let me know what you think!

Starting to blog

Today I have resolved to begin writing my blog on various (and often random) programming topics that I happen to enjoy.

I hope that by blogging, I improve my ability to communicate complex programming concepts and share some insight to my views on software development in general.

Along the way, I'm sure having these notes will be invaluable and if other's in the community benefit from this blog, all the better!

I look forward to any constructive criticisms of my thoughts on our awesome profession and the discussions that will follow.  I look forward to sharing my passion of development!