Logon Trigger Woes

T-SQL-Tuesday-Logo

This month’s TSQLTuesday is hosted by Steve Jones on the topic of triggers!

In a large enough environment, eventually the question will be asked “is this database still being used?” There’s a few technologies available to help you answer this question and by far the worst is logon triggers.

I work in a large mixed environment (SQL Server, Oracle, MySQL, YourSQL, etc…) and logon triggers have been used here for a very long time. Specifically on the Oracle servers, logon triggers are used to track who is connecting to the database from where. To a lesser extent, they’re also used to control where connections come from. So the idea of using logon triggers to identify if a SQL Server was still being used seemed very natural to the team.

Logon triggers were something I learned for a 2008 certification exam but I have never seen them put to use in a SQL Server environment.

Setting up a Logon Trigger

First, I created a database with a table to capture login information. Columns for the login, the timestamp, and the host they rode in on:

CREATE DATABASE [DBAInfo]

USE [DBAInfo]
CREATE TABLE [dbo].[LoginInfo](
[Login_Name] [nvarchar](256) NULL,
[Login_Time] [datetime] NULL,
[Host_Name] [nvarchar](128) NULL
) ON [PRIMARY]

I wrote my logon trigger to delete any old Login\Host records before inserting the new record. The idea being that I really only care about the most recent login. I also don’t want this table to cause any growth concerns and I also don’t want to put a purge job in place. Having the DELETE in the trigger takes care of all those concerns (but adds new ones, more on that below):

CREATE TRIGGER [LastLogon]
ON ALL SERVER WITH EXECUTE AS 'sa'
FOR LOGON
AS
BEGIN
DECLARE @hostname NVARCHAR(128)

--Capture the hostname
SELECT @hostname = EVENTDATA().value('(/EVENT_INSTANCE/ClientHost)[1]','NVARCHAR(128)')

--Remove old Login\Hostname records
DELETE FROM DBAInfo.dbo.LoginInfo
WHERE [Login_Name] = ORIGINAL_LOGIN()
AND [Host_Name] = @hostname

--Insert the new record
INSERT INTO DBAInfo.dbo.LoginInfo
SELECT ORIGINAL_LOGIN(), GETDATE(),
EVENTDATA().value('(/EVENT_INSTANCE/ClientHost)[1]','NVARCHAR(128)')
END
GO

The trigger works as expected, recording all logins. As a DBA, I like having a table to query and sort to my liking.

Ooops

But what happens when the trigger fails? My first brush with this was when I dropped the database containing the logon trigger. This causes an error in the trigger and fails the logon! No users or application accounts can login, including myself.

The error message provided is “Logon failed for login due to trigger execution” and there are two ways to fix it.
1. Logon with the dedicated administrator connection (DAC) and disable\drop the trigger.
1. Restart SQL Server in single user mode and disable\drop the trigger.

While I may be able to go the rest of my career without breaking this trigger, another DBA might not. Mistakes happen to all of us and having a server’s availability hinge on a single table or database strikes me as very risky.

Deadlocks

I implemented this trigger on a test system that was being monitored by Idera SQL Diagnostic Manager. Idera was creating so many connections that it was deadlocking due to the logon trigger. Both failed logons and deadlocks are logged in the SQL Server Error log:
Idera_Deadlocks

Summary

My logon trigger, as written, cannot go into production. If you are forced to use a logon trigger in production, here’s some thoughts:

  1. Make sure you have error handling in place. The trigger can never fail or you will lock users out of the instance.
  2. Make sure you have remote DAC configured (and know how to use it!), for cases when the trigger does fail.
  3. Limit the transactions in the trigger to avoid deadlocking.

If you’re like me and need to get a grip on who’s loging into your server, a better solution is really to use an audit to capture logins (post coming in the future on that), a trace, or a third party tool. There are many ways this trigger could be re-written to prevent these issues but IMO logon triggers are too risky for critical environments.

SUSER_SNAME() Returning the Wrong Login

If users are logging into your SQL Server with their Active Directory accounts, you may run into an issue if their login changes. This issue is usually associated with someone changing their account name to reflect a change in their legal name. If the user executes the function SUSER_SNAME(), it may return their old login.

There is some caching that takes place on the server which is holding onto the “key” associated with that account. When a user has their account name changed, Active Directory will not update this “key”, only the “name”. Any server that has cached the account may retain the old account name. When SQL Server processes the SUSER_SNAME() call, the old account name is returned.

Rebooting the server has been shown to resolve this issue but there may be a better way. If your users are obtaining access through an Active Directory group (which they should be), simply creating a login for that specific user and then dropping it will resolve the issue.

Creating the login will require the server to validate the account “name” against Active Directory, during which the cached record will be over-written. Dropping the login will return the instance configuration to it’s original state, where the user obtains access via their group.

Finding VMotions with Powershell and the VCenter CLI

If you work in a VMWare shop you’re probably aware that VMs can be migrated to a new host under certain conditions. While the purpose of these migrations is good, I’ve found that virtual SQL Servers usually do not like to be VMotioned. At my current employer, this is often noticed in AlwaysOn Availbility Groups.

Our team will occasionally get notified that an AlwaysOn Group has failed over. After investigating, no meaningful cause can be identified. Then, I’ll look in the VSphere client and sure enough, the server was migrated to a new host. This happens so often that VSphere is now the first place I check for failovers. VMWare makes some Powershell capabilities available to us, so I’ve created a CMDlet to quickly return the VMWare “events” for a given VM.

Before I get to the code, you might be thinking that VMotion is not the culprit. I assure you it is. I routinely have this argument with virtualization experts and even other DBA’s. “But Microsoft says it’s supported!”… I know they do, but VMs still don’t migrate without impact 100% of the time. As a DBA, I believe VMotions should be used only for host outages (planned or un-planned). Letting VCenter migrate VMs based on resource usages means you havn’t allocated the proper resources to your servers.

First, download and install the VMWare PowerCLI. You can find the install and the documentation here https://www.vmware.com/support/developer/PowerCLI/index.html

As the name implies, the Command Line Interface is going to drop you into an interactive session with VCenter. We’re going to design our function to perform the following:

  1. Save our current location
  2. Import the CLI module
  3. Open the CLI
  4. Use the CLI to get the events for our server
  5. Close the CLI
  6. Return to our original location

The first step is easily accomplished with a native Powershell CMDLet and the second step is a basic module import

Push-Location

Import-Module VMWare.VimAutomation.Core

This CMDLet captures your current directory and adds it to a stack for use later.

Next, open the CLI with the following, replacing $VCENTERSERVER with your VCenter server

$VCenterServer = 'VCenterServerXYZ'
Connect-VIServer $VCenterServer

Once the CLI is opened, we need to retrieve the VM we’re after, then retrieve the events for that VM

$Server = 'ServerABC'
Get-VM -Name $Server | Get-VIEvent

The above command will output the results to our console, I prefer to clean up the output as follows

$Server = 'ServerABC'
Get-VM -Name $Server | Get-VIEvent | Select-Object CreatedTime, FullFormattedMessage

Once we have our information, we can close the CLI

Disconnect-VIServer -Server $VCenterServer

And lastly, pop our original directory location off the stack

Pop-Location

In practice, I have this code wrapped up in a CMDLet and loaded into every Poweshell session via my profile. I have the VCenter server hardcoded but the VM that I’m retrieving events for is a CMDLet parameter. The whole CMDLet looks like this

<#
.NOTES
 NAME:    Get-VMEvents
 AUTHOR:  James Livingston

.SYNOPSIS
Queries VCenter Server for the events related to a specific virtual machine.
Requires the VCenter CLI.

.EXAMPLE
Get-VMEvents -Server ServerABC
#>
function Get-VMEvents {
    [cmdletbinding()]
    param (
        [Parameter(Mandatory = $true)] [string]$Server
    )

    Push-Location

    Import-Module VMware.VimAutomation.Core

    $VCenterServer = 'VCenterServerXYZ'

    Connect-VIServer $VCenterServer | Out-Null

    Get-VM -Name $Server | Get-VIEvent | Select-Object CreatedTime, FullFormattedMessage

    Disconnect-VIServer -Server $VCenterServer -Force -Confirm:$false

    Pop-Location
}

Get-VMEvents -Server 'ServerABC'

So You Blew Up the Plan Cache…

Recently, I was asked to investigate high CPU on a production server. I began my routine which is a combination of wait stats, performance counters, and resource hogging statements. I had identified a CPU intensive stored procedure that had a cardinality issue in the plan. I was about to present my findings when…. POOF.

Another DBA had just executed DBCC FREEPROCCACHE and cleared the procedure cache. Immediately following this statement, the server hit 100% CPU and stayed there for 15-20 minutes. So, I started my investigation over and found the server was being flooded by a rarely seen wait type:

RESOURCE_SEMAPHORE_QUERY_COMPILE

RESOURCE_SEMAPHORE_QUERY_COMPILE

This wait type means SQL Server is trying to create an execute plan but doesn’t have enough memory, so the request waits until more memory is available.

On a stable system, you probably won’t see this wait type much unless you’re starved for memory. The server I was investigating had 64GB which seemed appropriate for the work load. So why are we getting hammered with these waits? And why is it taking so long to stabalize?

The key here was to look at the plan cache after the system stabalized. I use the SQLSkills script found on Kimberly Trip’s blog post.

Adhoc_Plans

Notice the high count of adhoc plans, they completely dominate the plan cache. 1/3rd of the plans are single use and the average use/plan is 58 uses. There’s far fewer procedure plans but they are extremely high re-use with 100k average uses/plan.

After the plan cache was cleared, all of these plans have to compete for same compilation memory, thus the memory waits. Having to re-compile tens of thousands of plans contributed to the high CPU, which in turn lead to dreadful database performance. In this scenario, a single use adhoc plan was using resources that would be better spent on a high use procedure plan. Had this been a system with a lower number of plans (or a lower number of adhoc plans), the performance degradation may have been shorter lived.

The lesson learned here is that blowing up the procedure cache on a system with a large amount of plans will lead to pain. In our case, we could not differentiate between the initial high CPU issue and the high CPU caused by clearing the cache. So in the end, we don’t know if DBCC FREEPROCCACHE helped the issue.

Rolling Downgrade From Enterprise to Developer

So, is it possible to go from Enterprise to Developer in an AlwaysOn cluster? TLDR: Yes.

I was working in a dev environment recently that had been setup with AlwaysOn Availability Groups using the Enterprise Edition of SQL Server 2016. This environment is completely devoted to development of our applications which means it qualifies for use of the Developer edition of SQL Server. Not all shops agree with using Developer edition for dev or test environments, and I understand the argument. However, our shop decided that in this case, we’re comfortable with it.

Now, there’s an “Edition Upgrade” option in the “Maintenance” tab of the SQL Server 2016 install. The intended use is to upgrade from Developer->Enterprise…. but there’s no functionality to downgrade Enterprise to Developer. I could have new virtual servers built and conduct a migration but I decided to attempt a “rolling downgrade” across this AlwaysOn cluster.

The basic process is to remove the replica from the Availability Group, un-install SQL Server Enterprise edition, re-install SQL Server Developer edition, and finally add the replica back to the Availability Group. During this rolling downgrade, SQL Server didn’t give any errors or complaints about mixing Enterprise with Developer editions in the same Availability Group.

If that sounds like a lot to you, keep in mind that I have everything automated (including the un-install, including the download\extraction of the installation ISO, everything….). The downgrade of the 2 node cluster was completed in under 30 minutes for this small database. Most of that time was spent downloading media from a remote server and installing SQL Server. My complete process was this:

  1. Set the AG to manual failover – This is a best practice for any patching\upgrading\downgrading processes
  2. On the secondary replica
  3. Remove secondary replica from AG
  4. Delete un-joined database on the secondary – SQL will complain later if the database files exist
  5. Un-install SQL Server enterprise on the secondary
  6. Install SQL Server developer on the secondary
  7. Add the secondary replica back to the AG and replicate our database(s)
  8. Copy any logins\jobs\objects required from the primary to the secondary – We’re about to failover, so configure anything you need on your “new” secondary
  9. Failover
  10. On our “new” secondary replica
  11. Remove new secondary replica from the AG
  12. Delete un-joined database on the secondary – SQL will complain later if the database files exist
  13. Un-install SQL Server enterprise on the secondary
  14. Install SQL Server developer on the secondary
  15. Add the secondary replica back to the AG and replicate our database(s)
  16. Copy any logins\jobs\objects required from the primary to the secondary
  17. Set the AG back to automatic failover

Our configuration here uses an AlwaysOn Listener which allows me to failover the AG without taking down the application. In fact, the application didn’t make a peep during the entire process.

Setting an AlwaysOn Secondary Database Owner to SA

I’ve spent some time in a couple AlwaysOn shops and if you don’t manage it well, you can quickly accumulate secondary databases that are owned by DBA accounts. Naturally, secondaries are read-only so you can’t simply change the owner to SA. One option is to fail over the AG and change the database owner but that would require downtime. A better option would be to prevent this situation all together.

To start, SSMS allows you to script out the addition of a database to an AlwaysOn group. If you do so, you’ll find that under the hood the following actions are performed:

  1. Add the database to the AG on primary
  2. Perform a full backup on the primary
  3. Restore the full backup to the secondary
  4. Perform a log backup on the primary
  5. Restore the log backup to the secondary
  6. Begin data movement

We’re going to employ the EXECUTE AS statement, which I haven’t ever used outside of a certification exam… until now. Taking the script that SSMS generates, wrap the first restore statement (the full backup restore) in the EXECUTE AS statement. Something to the tune of:

:Connect ServerTwo
EXECUTE AS LOGIN='sa'
  RESTORE DATABASE [MyDB] FROM  DISK = N'\\ServerOne\MyDB.bak',  NORECOVERY,  NOUNLOAD,  STATS = 5
  GO
REVERT

This will work even if the SA login is disabled but it will require you to be a member of the sysadmin role. Once the database has been restored to the secondary, it’s owner will be established as SA. I like this solution because it doesn’t require me to run the entire process while logged in as some other login. I can use my usual login and still keep my name off of any audits.

Who Is Using SSRS?

When I inherit an unruly SSRS environment (and they’re all unruly) I want to answer a few questions:

  1. What users are running reports?
  2. What reports are being executed?

The ReportServer database is very easy to understand but a majority of the queries I run hit the ExecutionLog table, or one of the ExecutionLog views. Microsoft has changed the schema of the ExecutionLog table over the years. They took the original ExecutionLog schema and created a view out of it, which tells us that there’s some integration point that couldn’t handle the schema change. They’ve also given us several views that perform some useful joins for us. Rather than re-invent the wheel, I almost always go straight to ExecutionLog2.

What users are running reports?

I use this query to quickly find the users that have executed a report in the last year. I change the time filter around but I like 1 year in environments where year-end reports are used.

SELECT DISTINCT [UserName]
FROM [dbo].[ExecutionLog2]
WHERE TimeStart >= GETDATE()-365

I use the results of this query to start communicated with the users and explaining the changes I may make to the SSRS environment.