How to combine the Unit Of Work pattern and Repository pattern in an easy and intuitive way

How to structure my data layer in a project is something I have put a lot of thought into. It would seem that I'm a little OCD when it comes to code cleanliness and organization. Every time I write data layer code I end up pondering the same questions and wasting far too much time trying to come up with the absolute perfect solution. In order to make this article clear let's start off with a list of goals.

  • The data layer must be abstracted. In other words, the rest of the application should be completely unaware of the technology used in the data layer. If we are using Entity Framework (EF) then only the data layer should know that. If I wanted to swap out the tech for a completely different technology then I should only have to modify the data layer.
  • It has to be testable. Unit tests should be easy to write against this data layer. That means we need simple interfaces so we can mock them up in our test cases.
  • It needs the ability to perform transaction-style work. This is also known as the Unit Of Work pattern.

These goals seem simple enough so let's have a look at what we have out of the box with Entity Framework real quick. Believe it or not, EF is already a brilliant combination of the Unit Of Work pattern and the Repository pattern put together. EF creates an object for us known as the data context. All we have to do is instantiate that object and then we have access to the various repositories (entity collections) attached to it. After interacting with the repositories we can then call SaveChanges() on the data context to save execute the current set of changes in a transaction. The EF data context is really quite impressive and the team at Microsoft that put the EF API together really knew what they were doing. Why then do we need to do anything at all? As awesome as the EF API is, it is not abstracted and it's not testable. Every part of the application that uses the data context would be writing LINQ expressions against the context.

When you write LINQ expressions in your business logic you are making your application depend on a data layer that has a query provider. A query provider is basically a framework that understands how to take LINQ expressions against an IQueryable<T> collection and translate those expressions into the data store's native query language. At the time of writing that pretty much means the data layer has to use EF or LINQ to SQL. The only other alternative would be to write up your own query provider and although I've never written one before, I don't imagine it would be a simple task to write a query provider every time you change your data store. The other problem with just using the raw data context is that it's not testable. The data context exposes entity collections that allow you to write any number of LINQ queries against. The data that gets returned from these queries could be almost anything. There is no way to reliably test data retrieval in our application.

The easy solution to both these problems is to use the repository pattern and a dependency injection framework. Let's say that I have a database with a table called tblBooks. First, let's consume data from our data store with plain EF.

DataContext dataContext = new DataContext();

// Retrieve all math books.
var mathBooks = dataContext.Books.Where(x => x.Genre == "Math");

// Add new book to data store.
var newBook = new Book
{
    Title = "The Big Goodbye"
};
dataContext.Books.Add(newBook);
dataContext.SaveChanges();

Because the Books collection is of type IQueryable<Book> EF knows how to translate the LINQ expression into SQL. Now let's wrap that collection with a repository so that we can use dependency injection and unit testing.

public interface IBookRepository
{
    IList<Book> GetBooks(string genre);
    public void AddBook(Book newBook);
    public void SaveChanges();
}

public class EFBookRepository : IBookRepository
{
    private DataContext _dataContext;

    public EFBookRepository(DataContext dataContext)
    {
        _dataContext = dataContext;
    }

    public IList<Book> GetBooks(string genre)
    {
        return _dataContext.Books.Where(x => x.Genre == genre);
    }

    public void AddBook(Book newBook)
    {
        _dataContext.Books.Add(newBook);
    }

    public void SaveChanges()
    {
        _dataContext.SaveChanges();
    }
}

Now that we have a repository we can create Ninject mappings and have the repository injected anywhere we need it. We can also unit test any areas that use this repository easily because we can just mock the repository. This is pretty nice and it's what I've done for a while now, but there are still some problems with it. One problem is the awkward way of saving changes. The data context has one SaveChanges() method while our repository pattern is putting one on every single repository. What if I inject 2 or more repositories into my application and I use them both before needing to call save? Which save do I then call? My instinct would be to call save on all repositories that were used, however this highlights another problem with the repository pattern and EF. If you deal with more than one entity in EF, especially via navigation properties, then both entities need to have come from the same data context object. If you instantiate two contexts and try to make entities from different contexts play nice together you will just want to pull your hair out. This means that the same instance of the data context needs to be injected into all repositories that are used. How is that a problem for save changes?

Because each repository now has the same data context all of the repositories' save methods are calling the same save method on the data context. So calling save on every repository is redundant. We need only call save on one of the repositories. So which one then? It doesn't matter, you just pick one at random and call save on it. This is a nasty situation that I knew I had to fix. Not only is it just plain awkward, but it doesn't adhere to separation of concerns. If I did ever switch out my data store technology to be something other than Entity Framework then I might have to call all the save methods on all the repositories I used which would mean I'd have to refactor business logic to ensure all saves were being called. The fix for this blunder is the Unit Of Work pattern.

The Unit Of Work pattern defines one object to be the entry point for accessing data. You manipulate the data through the use of this object and then call save when finished. It's nice and simple and it solves the problems of having multiple save methods on each repository. It even resolves an annoying issue with the repository pattern where in some scenarios I would need upwards of 10 different repositories all at the same time. Since we use Ninject for dependency injection I would have to write my constructors to accept every repository that I needed. I'd end up writing 10 different fields in my class, accepting 10 different repositories in the constructor, and assigning those 10 different repositories to the fields. With the Unit Of Work pattern all the repositories are attached to one object which I've decided to call a "Data Bucket".

public interface IDataBucket
{
    IBookRepository BookRepository { get; }
    void SaveChanges();
}

public class DataBucket : IDataBucket
{
    private DataContext _dataContext;
    private IKernel _kernel;
    
    public DataBucket(DataContext dataContext, IKernel kernel)
    {
        _dataContext = dataContext;
        _kernel = kernel;
    }

	public IBookRepository BookRepository { get { return _kernel.Get<IBookRepository>(); } }

    public void SaveChanges()
    {
        _dataContext.SaveChanges();
    }
}

You've likely noted by now what I was talking about earlier. We have pretty much rebuilt the Entity Framework data context. We have a single object with repositories attached to it and a single SaveChanges() method. Ironic as it seems, our data bucket is extremely robust and solves several problems. Both the data bucket and all attached repositories sit behind interfaces so that we can mock them up when writing tests for our application. We no longer have to worry that testing a controller in our MVC app will accidentally save test data into the actual database. The business logic of our application is now completely unaware of the technology that stores data in the database; it could be EF, L2S, NOSQL, etc etc...

The only thing I have to map with dependency injection now is the data bucket. Notice how I manually instantiate the repositories in the properties on the data bucket. This is the only place in the application where a repository will get instantiated so I see no need to map it in Ninject. If I really wanted to I could have accepted the repositories in the constructor and left them in Ninject, but that felt redundant. The other reason to forgo the Ninject mappings for the repositories on the data bucket is that the way I instantiate them here is in such a way that only the repositories that I access on the data bucket will get instantiated. If Ninject passed them in then they would all be instantiated. I haven't done any benchmarking but it makes sense not to instantiate 100 objects if I only plan to use a few of them.

I used to manually new up the repositories so that I didn't have to have Ninject pass them all in through the constructor. I have since learned that you can have Ninject inject its own IKernel, the internal API that Ninject uses to resolve dependencies. Once you have an instance of IKernel we can then manually ask Ninject to resolve the dependencies on our repositories at the moment we access the properties on the data bucket. Instantiating only the repositories that get accessed essentially creates a lazy-loaded data bucket. To access the data layer it's now as simple as injecting IDataBucket into your application, accessing the necessary repositories attached to it, and calling SaveChanges() on the data bucket when finished.

Naturally, if you have 100 repositories then this data bucket would end up with 100 properties. Since it's lazy-loaded it wouldn't cause any performance issues, but it could get a little hairy to navigate. The data bucket approach provides a convenient alternative that would solve that annoyance if need be. Instead of one generalized data bucket you could create multiple data buckets to encapsulate related repositories. For example, if you needed to deal with several repositories for books and authors you could create an IBooksBucket class and only attach the repositories that deal with books. If you then needed some other set of repositories, say for movies and producers, you could create an IMoviesBucket and so on. If they were small enough then you may not even need or want to lazy load the properties. It's completely up to you; the only caveat here is that you need to group your repositories in such a way that you would avoid ever needing to make data from two data buckets interact with each other. If you needed an entity from one data bucket and an entity from another one to interact with each other then you  just recreated the original problem where you now don't know which SaveChanges() method on which data bucket is the one we should call. You also have to make sure they both have the same EF data context, etc.

Honestly, for small applications I think a single data bucket makes the most sense. You don't have to worry about how to separate your repositories. You create a single entry point for your data layer that is extremely versatile and testable. It's even rewindable in case something were to go wrong during a transaction preventing you from ending up with corrupted data. I will be adopting this technique going forward. If I encounter any issues with this new programming style I'll be sure to post again on this topic, but for now I think I've finally found a pattern I can get behind without much doubt :D