Wednesday, March 25, 2009

Entity Framework Patterns: Identity Map

ADO.Net Entity Framework is a different way of looking at persistence than most of us are used to.  It wants us to do things like add objects/entities to our data context instead of saving them, then it doesn’t save those objects to the  database until we call SaveChanges(). We don’t directly save specific entities but instead EF tracks the entities we’ve loaded and then saves changes to db for any entities that it thinks have changed. My first reaction when I realized how different these concepts were from my standard way of saving data was that I hated it (this actually took place with LINQ to SQL which I still don’t care for due to the way it handles sprocs).  But the promise of rapid application development and more maintainable code kept me coming back. I started reading up on architectures using ORMs (mostly in the Java world) and I discovered that most of the things I initially didn’t like about Entity Framework and LINQ to SQL are actually accepted design patterns from the ORM world that have been developed by people much smarter than me who have been working for years to solve the Impedance Mismatch problem.  So I thought it might be helpful to talk about some of these patterns and how they are handled by Entity Framework.  The first one we’ll look at is Identity Map.

Identity Map Definition

In Martin Fowler’s book Patterns of Enterprise Application Architecture, he defines Identity Map with the following two phrases:

Ensures that each object gets loaded only once by keeping every every loaded object in a map. Looks up objects using the map when referring to them.

So what does this mean?  It’s probably better to demonstrate than to explain, so let’s look at the characteristics of Identity Map through some code examples.

There Can Be Only One

Let’s start by looking at the other way of doing things.  This is the non-Identity Map example. If we have an app that uses a simple persistence layer that does a database query,  and returns to us a DataTable we might see code like the following:

DataTable personData1 = BAL.Person.GetPersonByEmail("bill@gates.com");

DataTable personData2 = BAL.Person.GetPersonByEmail("bill@gates.com");

if (personData1 != personData2)

{

    Console.WriteLine("We have 2 different objects");

}

In this example, personData1 and personData2 both contain separate copies of the data for person Bill Gates. If we change the data in personData2, it has no effect on personData1.  They are totally separate objects that happen to contain the same data. If we make changes to both and then save them back to the database there is no coordination of the changes. One just overwrites the changes of the other.  Our persistence framework (ADO.Net DataTables) just doesn’t know that personData1 and personData2 both contain data for the same entity.  The thing to remember about this scenario is that multiple separate objects that all contain data for the same entity, lead to concurrency problems when it’s time to save data.

Now let’s look at the Identity Map way of doing things. Below, we have some ADO.Net Entity Framework code where we create two different object queries that both get data for the same person, and then we use those queries to load three different person entity objects.

EFEntities context = new EFEntities();

 

var query1 = from p in context.PersonSet

            where p.email == "bill@gates.com"

            select p;

Person person1 = query1.FirstOrDefault<Person>();

Person person2 = query1.FirstOrDefault<Person>();

 

var query2 = from p in context.PersonSet

             where p.name == "Bill Gates"

             select p;

Person person3 = query2.FirstOrDefault<Person>();           

 

if (person1 == person2 & person1==person3)

{

    Console.WriteLine("Identity Map gives us 3 refs to a single object");

}

 

person1.name = "The Billster";

Console.WriteLine(person3.name); // writes The Billster

When I run the code above, all 3 entities are in fact equal.  Plus, when I change the name property on person1, I get that same change on person3. What’s going on here?  They’re all refs to a single object that is managed by the ObjectContext. So Entity Framework does some magic behind the scenes where regardless of how many times or how many different ways we load an entity, the framework ensures that only one entity object is created and the multiple entities that we load are really just multiple references to that one object. That means that we can have 10 entity objects in our code and if they represent the same entity, they will all be references to the same object.  The result is that at save time we have no concurrency issues.  All changes get saved.  So how does this work? 

Every entity type has a key that uniquely identifies that entity.   If we look at one of our Person entities in the debugger, we notice that it has a property that Entity Framework created for us named EntityKey.  EntityKey contains a lot of information on things like what key values our entity has (for our Person entity the key field is PersonGuid), what entity sets our entity belongs to, basically all the information Entity Framework needs to uniquely identify and manage our Person entity.

The EntityKey property is used by the ObjectContext (or just context) that Entity Framework generates for us.  In our example the context class is EFEntities.  The context class does a number of things and one of them is maintaining an Identity Map.  Think of the map as a cache that contains one an only one instance of each object identified by it’s EntityKey. In fact, you will probably never hear the term Identity Map used.  Most .Net developers just call it the object cache, or even just the cache. So, in our example,  when we get person1 from our context, it runs the query, creates an instance of person (which the context knows is uniquely identified by field PersonGuid), stores that object in the cache, and gives us back a reference to it.  When we get person2 from the context, the context does run the query again and pulls data from our database, but then it sees that it already has a person entity with the same EntityKey in the cache so it throws out the data and returns a reference to the entity that’s already in cache.  The same thing happens for person3.

Quiz: What Happens To Cached Entities When  the Database Changes?

So here’s a question.  If we run the code sample above that loads person1, person2, and person3 from our context, but this time we use a break point to pause execution right after we load person1, then we manually update the database by changing the phone_home field on Bill Gates’ record to “(999) 999-9999”, then we continue executing the rest of our code. What value will we see for phone_home when we look at person1, person2, and person3?  Will it be the original value, or the new value?  Remember that all 3 entities are really just 3 references to the same entity object in the cache, and our first db hit when we got person1 did pull the original phone_home value, but then the queries for person2 and person3 also hit the database and pulled data.  How does Entity Framework handle that. The answer is shown in the debugger watch window below. It throws the new data out. 

image 

This can lead to some really unexpected behavior if you don’t know to look for it, especially if you have a long running context that’s persisted and used over and over for multiple requests.  It is very important to be thinking about this when you’re deciding when to create a context, how long to keep it running, and what you want to happen when data on the backend is changed.  There is a way to modify this behavior for individual queries by setting the ObjectQuery.MergeOption property.  But we still need to remember and plan for this default behavior.

If There’s a Cache, Why Am I Hitting The Database? 

Remember the second part of Martin Fowler’s definition where he said that the Identity Map looks up objects using the map when referring to them?  The natural question that comes to mind is, if I’m loading an object that already exists in my cache, and Entity Framework is just going to return a reference to that cached object and throw away any changes it gets from the database query, can’t I just get the object directly from my cache and skip the database query altogether? That could really reduce database load.

Unfortunately the answer is kind of, but not really.  In Entity Framework v1, you can get an entity directly from the cache without hitting the database, but only if you use a special method to get the entity by it’s EntityKey.  Having to use the EntityKey is a big limitation since most of the time you want to look up data by some other field.  For example, in a login situation I need to get a person entity by email or username.  I don’t have the PersonGuid.  I’m hoping that we get more options for loading entities from the cache in v2 but for now, if you do have the key field, this is how you do it:

Guid billsGuid = new Guid("0F3087DB-6A83-4BAE-A1C8-B1BD0CE230C0");

EntityKey key = new EntityKey("EFEntities.PersonSet", "PersonGuid", billsGuid);

Person bill = (Person)context.GetObjectByKey(key);

There are a couple of things I want to point out.  First, when we creating the key, the first parameter we have to give is the entity set name that we’re pulling from and this name must include the name of our ObjectContext class. Second, you’ll notice that GetObjectByKey() returns an Object, so we did have to cast the return value to Person.

Conclusion

So that’s one pattern down.  Hopefully discussing some of these differences in approaching persistence helps ease your transition to using Entity Framework a bit.  Next time we’ll cover another key pattern, Unit of Work.

kick it on DotNetKicks.com

13 comments:

  1. A good point is raised, thanks for sharing it

    ReplyDelete
  2. Me new to EF... Have couple of questions:

    >>>That means that we can have 10 entity objects in our code and if they represent the
    >>same entity, they will all be references to the same object. The result is that at save
    >>>time we have no concurrency issues. All changes get saved. So how does this work?

    No concurrecy issues? Does that mean the entity object returned is thread safe?

    ReplyDelete
  3. Hi Vyas, good point. Now that I look at it "no concurrency issues" was a poor choice of phrase. I didn't even consider whether the objects are thread safe, BTW I would never even consider accessing the same context from multiple threads. But really I don't know enough to even comment on real multi-thread concurrency issues with EF.

    ReplyDelete
  4. I totally agree. EF is not the way to go. I personally use straight ADO.Net and SQL. I use this tool called Orasis Mapping Studio 2009. I found it at http://www.orasissoftware.com. With the 30 day trial I build my data tier at no time. No third party dependencies no reflection. I mapped my queries to my own data types and I also let the IDE create some for me. Lean and fast code that I can fully manage. I build it all visually.

    ReplyDelete
  5. I read somewhere "Friends don't let friends use Entity Framework!". That is absolutely true.

    I am hearing a lot about Orasis. I tried their stuff and have used it to map some of our Objects to some queries. It has a unique approach to it and definitely has some potential. I would recommend anyone still in the market for an OR/M replacement to atleast try it.

    Here is their download link: http://www.orasissoftware.com/download.aspx

    ReplyDelete
  6. Even now I'm struggling with whether I want to use Entity Framework. I did finally decide to learn it and figure out some best practices. The main factors for me are 1) this is Microsoft's main ORM, it's in Visual Studio, at some point I am going to run into code that uses it, 2) As many problems as I have with it, I have to admit that it does make dev time shorter. I can see that even while learning it. So I don't think EF is the best solution available (I think nHibernate's better at the moment). But, I do think that EF is the best ORM to learn because it's the one I'll most likely need to support my clients. People are going to use it and Microsoft will eventually get it right, just like they did with ADO.Net. That of course assumes that they don't decide to toss the ORM approach altogether and put all their weight behind DSLs and OSLO. -rudy

    ReplyDelete
  7. I have found this article when hit the problem. I have two applications (with a global ObjectContext per whole application - It is because I have implemented repositories with basic CRUD and don't even touch ObjectContext) using this same database. You now already I run into this issue very quickly.

    No the question is how to force EF to replace cached data with freshly pulled from DB without Discarding ObjectContext.

    If it's not possible I come back to LinqToSql :-(

    Regards
    Mariusz

    ReplyDelete
  8. I have found the solution.
    The way to go is to set a MergeOptions property on the ObjectContext to MergeOption.OverwriteChanges and all records pulled from B will override these within the EF cache.

    Regards
    Mariusz

    http://msdn.microsoft.com/en-us/library/system.data.objects.mergeoption.aspx#Y23

    ReplyDelete
  9. My husband and I are building a house, and we have been talking to the contractor about the hydraulic hose fittings for the house. We were hoping to start building a lot sooner than we did, but that obviously didn't happen. Because the weather has started to turn, our contractor is suggesting that we wait until after winter to being construction. He mentioned that the foundation might not hold up over the winter months, and we could risk a cracked foundation for our home. If that happens, a crew would have to come rip up the cracked foundation and have to put a new one in. Which is a delay as well. Is it better to wait, or can we proceed with the construction and risk a cracked foundation?

    ReplyDelete
  10. Have you check the hydraulic hose fittings? Its always important to check those.

    ReplyDelete
  11. I have a question. Say I am go to web page one which performs a query using EF and the results are returned in an entity type. Now I go to another web page, web page two, and that page also performs a query using EF and those results are return in the same entity type. Now the query on webpage two is pretty much identical to page one's query except additional results could be returned because of a timespan being longer.

    Now between the time the query on page one was completed, and before the query on page two begins some results stored in the entity that contains the results of page one's query has changed in the database.
    When page two's query is completed will those changes appear in the entity along with the additional results, are will the entity contain page one's query results with the additional results? From the article is sound like the later will be the results.

    Please feel free to contact me at sun_Water_snow@hotmail.com with any comments are answered to the question.

    ReplyDelete
  12. The statement "Entity Framework is just going to return a reference to that cached object and throw away any changes it gets from the database query, " is incorrect in my scenero.

    I am displaying the result of a query based on EF 4.3 . I am using a Telerik grid so display the results of a query using EF. When the same exact query is repeated with information in the database being changed between queries, the entity used for the query has the changed data. According to this article the changed data in the database should be shown. The second query results should have been thrown out.
    Can you explain why what you quoted is incorrect in my case?

    ReplyDelete
    Replies
    1. I meant the phrase
      "EntityKey in the cache so it throws out the data and returns a reference to the entity that’s already in cache"
      My example above returns entity with changed data.

      Delete