SQL Server Extended Events in Production

I have finally jumped on the Extended Events bandwagon. It was tough to pry Trace and Profiler from my cold, legacy hands! But I'm mostly there. I'm still working on a good way to hook up ClearTrace to these things. It might be as simple as requiring you to pull down the XE files. I'd like to make it simpler though.

I've read lots of article about Extended Events that describe how to create and query them but very few about which ones are helpful. Here are the events I've been capturing on our production servers that have been helpful.

Fields

We try to capture a consistent set of fields in addition to the event specific fields. Here's what we capture:

client_app_name
client_hostname
client_pid. This is the Process ID on the client that is connecting. If needed you can consult logs on the client to learn more about what this was.
database_name. Sometimes I capture database_id also. Especially in older versions where database_name isn't available.
event_sequence
server_instance_name. Yes, this is redundant since you're already connecting to the server and you can query the server name. I suggest you capture it anyway. It's very helpful when you start to centralize these.
server_principal_name
session_id
sql_text

There are a few fields I've thought about capturing or have captured but found less valuable.

Query and Plan hashes. I had high hopes for these but haven't ever used them.
Session Server Principal Name. We don't do any impersonation that I know of so this hasn't been useful so far. I'm still mulling over collecting this instead of just the server principal name. Ideally I would collect them both and flag any differences. But I'm not there yet.

Events

I typically run two Extended Event sessions: one for logins and one for everything else. We keep logins going back for as long as can on the server. Right now I'm running ten 50 MB files for these on most servers. A few high traffic servers have twenty 50 MB files. It's also important to filter for is_cached is false.

Many of these events overlap with the system_health and AlwaysOn_health sessions. I don't mind the duplicates. They don't occur all that often.

Error Reported

I was initially filtering for just severity 14 (security related) and severity >= 16 (general errors). I'm starting to experiment with capturing 11 and higher but I don't know enough to write anything meaningful yet. I also filter out 17830. For some reason, we get LOTS of them and I've never tracked down why. YMMV.

This event has proven very, very helpful. I capture our applications doing all kinds of interesting things. When we do new deployments, I watch this closely. I often notice problems before the application teams because I can see the application generating errors. I see permission errors, primary key violations, syntax errors, and all kinds of crazy app stuff. We're just starting a project to give our developers and others in the organization easy access to these events.

You will also see pesky things like user generated errors. You can filter those out in your reporting. But I'm not a fan of using RAISERROR with a severity higher than 10 unless it's an actual exception that you plan to fix.

Blocked Process Report

This requires you to set the blocked process threshold on your server. I typically set it at 15 seconds or 60 seconds for servers that run big long ugly queries and 5 or 10 seconds for more transactional system. The trick is to set a value that generates enough but not too many. And don't set it below 5. But it may take some experimentation.

I use this to answer the question, "What was blocking our transactions at 5:30 AM this morning?"

Deadlocks

There are two deadlock events I capture: lock_deadlock_chain and xml_deadlock_report. Between the two of them I get enough information to figure out what happened.

High Resource SQL

This is four events, rpc_completed, sp_statement_completed, sql_batch_completed, sql_statement_completed and I capture all four. The batch level and statement level captures give me slightly different views into what was running. My default filter for these is CPU > 1 second or logical reads > 500,000. You'll need to tweak these based on your environment and how much data you want to review. But these have proven to be a good starting point.

Error Log Written

I typically don't refer to this one as often but it's nice to have in place with the others.

Object Created, Altered, or Deleted

I'm just starting to experiment with these. We currently capture all these through an Event Notification on DDL_EVENT. And that works really well so I haven't had the need for this yet. I do know that you should filter out events on database_id = 2 or you'll get all the tempdb objects. And I'm seeing lots of CEIP extended event work which may be related to the fact that these servers can't see the Internet.

Summary

So those are the events and fields I capture through Extended Events on production servers. I've found this data to be very helpful troubleshooting production problems.