Finding VMotions with Powershell and the VCenter CLI

If you work in a VMWare shop you’re probably aware that VMs can be migrated to a new host under certain conditions. While the purpose of these migrations is good, I’ve found that virtual SQL Servers usually do not like to be VMotioned. At my current employer, this is often noticed in AlwaysOn Availbility Groups.

Our team will occasionally get notified that an AlwaysOn Group has failed over. After investigating, no meaningful cause can be identified. Then, I’ll look in the VSphere client and sure enough, the server was migrated to a new host. This happens so often that VSphere is now the first place I check for failovers. VMWare makes some Powershell capabilities available to us, so I’ve created a CMDlet to quickly return the VMWare “events” for a given VM.

Before I get to the code, you might be thinking that VMotion is not the culprit. I assure you it is. I routinely have this argument with virtualization experts and even other DBA’s. “But Microsoft says it’s supported!”… I know they do, but VMs still don’t migrate without impact 100% of the time. As a DBA, I believe VMotions should be used only for host outages (planned or un-planned). Letting VCenter migrate VMs based on resource usages means you havn’t allocated the proper resources to your servers.

First, download and install the VMWare PowerCLI. You can find the install and the documentation here https://www.vmware.com/support/developer/PowerCLI/index.html

As the name implies, the Command Line Interface is going to drop you into an interactive session with VCenter. We’re going to design our function to perform the following:

  1. Save our current location
  2. Import the CLI module
  3. Open the CLI
  4. Use the CLI to get the events for our server
  5. Close the CLI
  6. Return to our original location

The first step is easily accomplished with a native Powershell CMDLet and the second step is a basic module import

Push-Location

Import-Module VMWare.VimAutomation.Core

This CMDLet captures your current directory and adds it to a stack for use later.

Next, open the CLI with the following, replacing $VCENTERSERVER with your VCenter server

$VCenterServer = 'VCenterServerXYZ'
Connect-VIServer $VCenterServer

Once the CLI is opened, we need to retrieve the VM we’re after, then retrieve the events for that VM

$Server = 'ServerABC'
Get-VM -Name $Server | Get-VIEvent

The above command will output the results to our console, I prefer to clean up the output as follows

$Server = 'ServerABC'
Get-VM -Name $Server | Get-VIEvent | Select-Object CreatedTime, FullFormattedMessage

Once we have our information, we can close the CLI

Disconnect-VIServer -Server $VCenterServer

And lastly, pop our original directory location off the stack

Pop-Location

In practice, I have this code wrapped up in a CMDLet and loaded into every Poweshell session via my profile. I have the VCenter server hardcoded but the VM that I’m retrieving events for is a CMDLet parameter. The whole CMDLet looks like this

<#
.NOTES
 NAME:    Get-VMEvents
 AUTHOR:  James Livingston

.SYNOPSIS
Queries VCenter Server for the events related to a specific virtual machine.
Requires the VCenter CLI.

.EXAMPLE
Get-VMEvents -Server ServerABC
#>
function Get-VMEvents {
    [cmdletbinding()]
    param (
        [Parameter(Mandatory = $true)] [string]$Server
    )

    Push-Location

    Import-Module VMware.VimAutomation.Core

    $VCenterServer = 'VCenterServerXYZ'

    Connect-VIServer $VCenterServer | Out-Null

    Get-VM -Name $Server | Get-VIEvent | Select-Object CreatedTime, FullFormattedMessage

    Disconnect-VIServer -Server $VCenterServer -Force -Confirm:$false

    Pop-Location
}

Get-VMEvents -Server 'ServerABC'

So You Blew Up the Plan Cache…

Recently, I was asked to investigate high CPU on a production server. I began my routine which is a combination of wait stats, performance counters, and resource hogging statements. I had identified a CPU intensive stored procedure that had a cardinality issue in the plan. I was about to present my findings when…. POOF.

Another DBA had just executed DBCC FREEPROCCACHE and cleared the procedure cache. Immediately following this statement, the server hit 100% CPU and stayed there for 15-20 minutes. So, I started my investigation over and found the server was being flooded by a rarely seen wait type:

RESOURCE_SEMAPHORE_QUERY_COMPILE

RESOURCE_SEMAPHORE_QUERY_COMPILE

This wait type means SQL Server is trying to create an execute plan but doesn’t have enough memory, so the request waits until more memory is available.

On a stable system, you probably won’t see this wait type much unless you’re starved for memory. The server I was investigating had 64GB which seemed appropriate for the work load. So why are we getting hammered with these waits? And why is it taking so long to stabalize?

The key here was to look at the plan cache after the system stabalized. I use the SQLSkills script found on Kimberly Trip’s blog post.

Adhoc_Plans

Notice the high count of adhoc plans, they completely dominate the plan cache. 1/3rd of the plans are single use and the average use/plan is 58 uses. There’s far fewer procedure plans but they are extremely high re-use with 100k average uses/plan.

After the plan cache was cleared, all of these plans have to compete for same compilation memory, thus the memory waits. Having to re-compile tens of thousands of plans contributed to the high CPU, which in turn lead to dreadful database performance. In this scenario, a single use adhoc plan was using resources that would be better spent on a high use procedure plan. Had this been a system with a lower number of plans (or a lower number of adhoc plans), the performance degradation may have been shorter lived.

The lesson learned here is that blowing up the procedure cache on a system with a large amount of plans will lead to pain. In our case, we could not differentiate between the initial high CPU issue and the high CPU caused by clearing the cache. So in the end, we don’t know if DBCC FREEPROCCACHE helped the issue.

Rolling Downgrade From Enterprise to Developer

So, is it possible to go from Enterprise to Developer in an AlwaysOn cluster? TLDR: Yes.

I was working in a dev environment recently that had been setup with AlwaysOn Availability Groups using the Enterprise Edition of SQL Server 2016. This environment is completely devoted to development of our applications which means it qualifies for use of the Developer edition of SQL Server. Not all shops agree with using Developer edition for dev or test environments, and I understand the argument. However, our shop decided that in this case, we’re comfortable with it.

Now, there’s an “Edition Upgrade” option in the “Maintenance” tab of the SQL Server 2016 install. The intended use is to upgrade from Developer->Enterprise…. but there’s no functionality to downgrade Enterprise to Developer. I could have new virtual servers built and conduct a migration but I decided to attempt a “rolling downgrade” across this AlwaysOn cluster.

The basic process is to remove the replica from the Availability Group, un-install SQL Server Enterprise edition, re-install SQL Server Developer edition, and finally add the replica back to the Availability Group. During this rolling downgrade, SQL Server didn’t give any errors or complaints about mixing Enterprise with Developer editions in the same Availability Group.

If that sounds like a lot to you, keep in mind that I have everything automated (including the un-install, including the download\extraction of the installation ISO, everything….). The downgrade of the 2 node cluster was completed in under 30 minutes for this small database. Most of that time was spent downloading media from a remote server and installing SQL Server. My complete process was this:

  1. Set the AG to manual failover – This is a best practice for any patching\upgrading\downgrading processes
  2. On the secondary replica
  3. Remove secondary replica from AG
  4. Delete un-joined database on the secondary – SQL will complain later if the database files exist
  5. Un-install SQL Server enterprise on the secondary
  6. Install SQL Server developer on the secondary
  7. Add the secondary replica back to the AG and replicate our database(s)
  8. Copy any logins\jobs\objects required from the primary to the secondary – We’re about to failover, so configure anything you need on your “new” secondary
  9. Failover
  10. On our “new” secondary replica
  11. Remove new secondary replica from the AG
  12. Delete un-joined database on the secondary – SQL will complain later if the database files exist
  13. Un-install SQL Server enterprise on the secondary
  14. Install SQL Server developer on the secondary
  15. Add the secondary replica back to the AG and replicate our database(s)
  16. Copy any logins\jobs\objects required from the primary to the secondary
  17. Set the AG back to automatic failover

Our configuration here uses an AlwaysOn Listener which allows me to failover the AG without taking down the application. In fact, the application didn’t make a peep during the entire process.

Pester Your SQL Servers!

In March I did my first presentation at SQL Saturday. I’d say it went very well and I was very encouraged with the attendee feedback. My presentation was on infrastructure testing with the Pester Powershell module. I didn’t go into depth on the advanced features of Pester (ie. mocking) but instead focused on how you can write Pester tests to validate SQL Server. You can find a repo of my demo and slides here. Fair warning, my presentation is heavy demo and talking, with brief slides.

As far as infrastructure testing goes, here’s a simple example that tests if the SQL Server service is running:

Describe 'SQL Server Services' {     
    It 'Should be running' {
        (Get-Service 'MSSQL$SQL1').Status | Should -Be 'Running'
    }
}

And here’s an example of querying SQL internals (it test if the max memory has been changed from the default value):

$SQLInstance = 'localhost\sql1'

$Query       = `
"SELECT [name], [value] 
FROM sys.configurations
WHERE [name] IN 
('max server memory (MB)')"

$ConfigQueryRes = Invoke-SQLCMD -ServerInstance $SQLInstance -Database 'master' -Query $Query

Describe "Instance Configuration on $SQLInstance"  {

    It "Maximum Server Memory should be set" {
        $maxmem = ($ConfigQueryRes | Where {$_.name -eq 'max server memory (MB)'}).value
        $maxmem | Should -Not -Be 2147483647
    }
}

Pester gives you the ability to break up your tests into many test scripts, thus avoiding a single lengthy Powershell script. I spoke at my presentation about how you could build a library of scripts. One for instance configuration, one for logins, one for jobs, etc… The advantage of Pester is that you can write scripts, run them, and collect results all in a consistent way.

At my present employer, I’ve written well over 1000 lines of Pester tests, spread across a dozen or so test files. However, to run all these tests is a simple call to invoke-pester, a one liner.

Setting an AlwaysOn Secondary Database Owner to SA

I’ve spent some time in a couple AlwaysOn shops and if you don’t manage it well, you can quickly accumulate secondary databases that are owned by DBA accounts. Naturally, secondaries are read-only so you can’t simply change the owner to SA. One option is to fail over the AG and change the database owner but that would require downtime. A better option would be to prevent this situation all together.

To start, SSMS allows you to script out the addition of a database to an AlwaysOn group. If you do so, you’ll find that under the hood the following actions are performed:

  1. Add the database to the AG on primary
  2. Perform a full backup on the primary
  3. Restore the full backup to the secondary
  4. Perform a log backup on the primary
  5. Restore the log backup to the secondary
  6. Begin data movement

We’re going to employ the EXECUTE AS statement, which I haven’t ever used outside of a certification exam… until now. Taking the script that SSMS generates, wrap the first restore statement (the full backup restore) in the EXECUTE AS statement. Something to the tune of:

:Connect ServerTwo
EXECUTE AS LOGIN='sa'
  RESTORE DATABASE [MyDB] FROM  DISK = N'\\ServerOne\MyDB.bak',  NORECOVERY,  NOUNLOAD,  STATS = 5
  GO
REVERT

This will work even if the SA login is disabled but it will require you to be a member of the sysadmin role. Once the database has been restored to the secondary, it’s owner will be established as SA. I like this solution because it doesn’t require me to run the entire process while logged in as some other login. I can use my usual login and still keep my name off of any audits.

Who Is Using SSRS?

When I inherit an unruly SSRS environment (and they’re all unruly) I want to answer a few questions:

  1. What users are running reports?
  2. What reports are being executed?

The ReportServer database is very easy to understand but a majority of the queries I run hit the ExecutionLog table, or one of the ExecutionLog views. Microsoft has changed the schema of the ExecutionLog table over the years. They took the original ExecutionLog schema and created a view out of it, which tells us that there’s some integration point that couldn’t handle the schema change. They’ve also given us several views that perform some useful joins for us. Rather than re-invent the wheel, I almost always go straight to ExecutionLog2.

What users are running reports?

I use this query to quickly find the users that have executed a report in the last year. I change the time filter around but I like 1 year in environments where year-end reports are used.

SELECT DISTINCT [UserName]
FROM [dbo].[ExecutionLog2]
WHERE TimeStart >= GETDATE()-365

I use the results of this query to start communicated with the users and explaining the changes I may make to the SSRS environment.

Mirroring Endpoint Ownership

Creating a mirroring endpoint in SQL Server can sometimes do goofy things with permissions.

I’m facing an issue right now where two bad things are happening when one of our DBAs creates an endpoint:

  1. The creator’s account becomes the owner of the endpoint, which is not un-common in SQL Server. However, like other ownerships, this will prevent you from dropping the owning login.
  2. The creator’s account is added as an individual login to the instance. Even if that account already has access through a domain group, it still gets added as an individual (bug?). We’ve got to jump through hoops to remove the login and it can’t be removed until the endpoint’s ownership is cleared up.

We can fix it with a one-liner:

ALTER AUTHORIZATION
ON ENDPOINT::[MirrorMirrorOnTheWall]
TO [DOMAIN\ACCOUNT]

Set the owner to an account that’s going to be around for a while (NOT A PERSON). If the account you chose doesn’t have sysadmin you may need to also grant connect privileges to the endpoint:

GRANT CONNECT
ON ENDPOINT::[MirrorMirrorOnTheWall]
TO [DOMAIN\ACCOUNT]

With problem number one solved we are no longer restricted on dropping the creator’s login (provided the login doesn’t own anything else). We can drop the login but there’s a way to avoid this whole mess all together.

Using EXECUTE AS LOGIN we can create the endpoint as the original login:

EXECUTE AS LOGIN = 'DOMAIN\ACCOUNT'
    CREATE ENDPOINT MirrorMirrorOnTheWallFixed
    STATE=STARTED 
    AS TCP (LISTENER_PORT=5022) 
    FOR DATABASE_MIRRORING (ENCRYPTION=SUPPORTED ALGORITHM AES, ROLE=ALL)
REVERT

Creating the owning login and then running the above script will keep things tidy. The endpoint gets created with the owner you determine and no individual logins get added to the instance.