Customer post-mortem: Issue with AD groups

Customer post-mortem: Issue with AD groups

By | 2016-12-02T15:03:44+00:00 December 2nd, 2016|Troubleshooting|0 Comments

Today I wrapped up one of the most challenging customer support issues we’ve ever had. In the end, the solution was ridiculously simple (as solutions often are), but it took many hours and misleading clues to get there (as it often does). Here’s the story, along with the twists and turns.

Gallery Server has supported Active Directory authentication for many years but until recently it hasn’t supported integration with AD groups. This is really nice to have in corporate environments because you don’t have to manage a separate set of roles, especially the addition and removal of AD accounts to/from roles. Gallery Server recognizes your AD groups, including their members, and uses them to secure your gallery access. We couldn’t fit this feature into the initial 4.0 release but we announced on our website that it would be available by the third quarter of 2016.

In June 2016, a customer from Norway approached us and asked if they could have early access to the promised support for AD groups. This feature was so important they were willing to pay a priority fee and accept that the feature was in a pre-release state. So we gave it to them and it worked well right out of the box. They were thrilled.

Now fast forward to November. That is, a couple of weeks ago. By that point we had officially released the AD role provider and the customer upgraded their gallery to 4.2.0. But they were having a problem. They stated that after the upgrade, new users could no longer access the gallery. After successfully authenticating with their AD account, they received this message:

Message indicating user is authenticated but does not have access to any albums in the gallery

Gallery Server shows this message when a user is logged in and does not belong to any roles that provide access to the gallery. The first thing I considered was whether Gallery Server was ignoring groups in AD that it was previously recognizing. So I studied the changes between the pre-release code and the production code they were now using, but nothing jumped out at me. Puzzled, I asked the customer to confirm that the pre-release code worked, and he said yes. I verified that the groups had the necessary permissions to albums on the Manage Roles page.

Then I asked for more details about their domain architecture. This is when I learned they had a fairly complex organizational structure:

Active Directory structure of customer's domain

This structure contains many groups spread across many organizational units. To ease management, they used nested groups where one or more groups at each office location were added to a few “master” groups stored in another OU:

Top level OU containing parent AD groups

It is these master groups, not the many child groups, that are configured to be recognized by Gallery Server in the ActiveDirectoryRoleProvider section of the web.config file:

AD group provider configuration in web.config

There is nothing wrong with this architecture. In fact, it is well organized, easy to manage, and supported in Gallery Server. When Gallery Server asks Active Directory for the members of a group, AD recursively processes all member groups and returns a flattened list of users.

The customer claimed that when a new user was added to one of these top level groups, everything worked fine. But when the user was in one of the nested groups, they would get the message shown earlier. With this information, I built a sample of this structure on my test domain to see if I could reproduce the issue. But no matter what I tried, it always worked. That is, Gallery Server would correctly recognize users contained in nested groups.

Around this time, the customer mentioned they were getting an error when they tried to expand a user on the Manage Users page:

Error on the Manage Users page when using the AD group provider

In the event log was the following entry:

System.Configuration.Provider.ProviderException: Unable to query Active Directory.

Stack trace:

at GalleryServer.Web.ActiveDirectoryRoleProvider.GetRolesForUser(String userName)
at GalleryServer.Web.Controller.RoleController.GetGalleryServerRolesForUser(String userName)
at GalleryServer.Web.Controller.UserController.GetUserEntity(String userName, Int32 galleryId)
at GalleryServer.Web.Api.UsersController.Get(String userName, Int32 galleryId)

Inner exception:

System.DirectoryServices.AccountManagement.PrincipalOperationException:  While trying to retrieve the authorization groups, an error (1789) occurred.

Inner exception stack trace:

at System.DirectoryServices.AccountManagement.AuthZSet..ctor(Byte[] userSid, NetCred credentials, ContextOptions contextOptions, String flatUserAuthority, StoreCtx userStoreCtx, Object userCtxBase)
at System.DirectoryServices.AccountManagement.ADStoreCtx.GetGroupsMemberOfAZ(Principal p)
at System.DirectoryServices.AccountManagement.UserPrincipal.GetAuthorizationGroupsHelper()
at GalleryServer.Web.ActiveDirectoryRoleProvider.GetRolesForUser(String userName)

A little googling discovered that error code 1789 means ERROR_TRUSTED_RELATIONSHIP_FAILURE – The trust relationship between this workstation and the primary domain failed. Based on this clue, the customer removed the server from the domain and re-added it. This solved the issue with the users page, but the issue with the nested groups remained.

At this point I asked the customer if they could take a look at my test domain architecture to see if they could help me reproduce the issue. So we set a day and time and used TeamViewer to share the screen. Turns out they thought I accurately modeled their architecture and couldn’t see any reason why it worked for me and not for them.

This was getting frustrating and I was starting to run out of ideas, but I wasn’t ready to give up. I asked them if they could give me access to a PC on their network where I could conduct some tests. They gave me a nice Win7 machine with TeamViewer access. I installed IIS, Visual Studio 2015 Community edition, and the source code for the pre-release version I gave them in June and the latest 4.2.0 code. I wanted to step through the code and compare the working version with the non-working version.

But I hit a stumbling block. When running the code, I got the following exception immediately after logging on:

System.DirectoryServices.AccountManagement.NoMatchingPrincipalException:  An error occurred while enumerating the groups. The group could not be found.

This error was getting thrown from deep within .NET code when enumerating the groups returned by UserPrincipal.GetAuthorizationGroups(). This error was not happening on the web server. It was a new problem I had to resolve before I could continue. Ugghhh. After a bit of googling I found that this error occurs when an SID cannot be resolved. After a bit of experimenting I determined it was just one of the groups that was triggering this issue, so I tweaked the code to silently swallow the exception.

That allowed me to run the code enough where I could confirm the behavior reported by the customer. Sure enough, Active Directory was not returning the expected groups from the GetAuthorizationGroups() method. It was returning some groups, but not all. This caused the RoleProvider.GetRolesForUser() method to return less roles than expected, thereby causing Gallery Server to inaccurately determine the user should not have access to the gallery.

When I ran the pre-release code, it also neglected to return all the groups. That is, I saw the same (incorrect) behavior in both versions of the AD role provider. But wait – the customer said it worked with the pre-release code and not with the latest code. It was challenging to make sense of all of this conflicting information. I had no plausible explanation for how this could have worked with the old code.

At this point I must have spent 10-12 hours troubleshooting with the customer and didn’t have any good theories about what was happening. How could it have worked before? Why did this affect only new users? How do I get it working again?

I wrote up my findings to the customer yesterday. I described how I could not find a difference in behavior between the old code and the new code. I mentioned that I believed other factors outside of Gallery Server were involved that led him to believe that the pre-release code worked. I offered to continue investigating if they could show me pre-release code that worked. And I described how Gallery Server was simply responding to the information given to it by Active Directory. To figure out why AD was returning unexpected results, I suggested bringing in a domain expert or Microsoft Support Services.

And then I said one more thing. A thing I almost didn’t write.

Please check that the AD group is a Security group, not a Distribution group. According to this thread, this can cause the behavior we are seeing.

AD groups that are distribution groups are not returned when you use the GetAuthorizationGroups() method. While that would explain the behavior we were seeing, it didn’t fit the other evidence that (1) the code used to work and (2) only new users are affected. It didn’t seem a likely cause.

But that turned out to be it. One of the AD groups was configured as a distribution group, and as soon as they changed it to a security group, everything worked as expected! The customer let me know he inadvertently gave me incorrect information. It’s not that the code used to work. It never did for users in this one particular AD group. It turned out that the users in that group never tried logging in to the gallery until the most recent upgrade, so the problem only appeared after the upgrade. It’s not clear to me why he thought it affected only new users, but it didn’t.

Lessons learned

This support case was difficult because of incorrect information provided by the customer and my inability to reproduce the issue on my test system. Here are some of the lessons:

  • The AD role provider doesn’t recognize distribution groups. Use security groups.
  • Please be careful with your words when you ask for help. Be sure what you are saying is accurate.
  • Look for patterns in the behavior you see. Run some experiments.
  • If I am unable to quickly resolve your issue, take a pause and reevaluate the behavior you are seeing. Challenge your assumptions.
  • If possible, give me full access to your server right away. Much of the time spent on this issue was taking the customer’s description at face value and trying to reproduce it on my system. That is not going to work if the information is wrong.
  • I repeat – let me log on to your network and server to see the issue for myself. Things get resolved more quickly when I can get at the source. I understand there may be security concerns but if you want to be efficient this is the way to do it. If necessary, use TeamViewer and watch me to assure yourself I’m not getting into trouble.

I’m not upset with the customer. I’ve made plenty of mistakes and wrong assumptions in my life, and more are sure to follow. But this is a good reminder of what we can strive for.

About the Author:

Founder and Lead Developer of Gallery Server

Leave A Comment