Caching replica state to eliminate HADR_CLUSAPI_CALL waits on dm_hadr_availability_replica_states

We use tsqlscheduler to manage most of our SQL Jobs (a few hundred jobs in a several overlapping AGs), and when lots of schedules overlap (e.g. many concurrent jobs kick off on the hour) we saw waits and blocking on the function that was attempting to determine whether or not the server was the primary replica:

Blocking

This function queries the DMV sys.dm_hadr_availability_replica_states, and we found that this DMV doesn't perform so well when the cluster size, number of AGs, and number of concurrent queries against the DMV rises.

Querying for the replica's role

select      @role = ars.role_desc
from        sys.dm_hadr_availability_replica_states ars
inner join  sys.availability_groups ag
on          ars.group_id = ag.group_id
where       ag.name = @availabilityGroupName
and         ars.is_local = 1;

Under the hood this DMV is calling into the Windows clustering APIs to determine the current status of the replica. There isn't a tonne of information out there - some comments on this SO answer confirm what the DMV is doing, and this MSFT post adds a little more detail. After checking some of the obvious items listed in the second post I decided that maybe we shouldn't be querying the DMV quite so often, and given how rarely the information changes ("is my node the primary replica?") caching this information seemed like a good fit.

Implementing caching

Caching of the replica state is available in any release of tsqlscheduler starting with 1.1. Each task now specifies whether or not it uses the cached replica check or not - most tasks should use the cached replica check, with the exception of the task which keeps the cache updated.

The caching is implemented by a task that persists all replica states for the given AG, and is queried by the GetCachedAvailabilityGroupRole function.

Update replica status

create or alter procedure scheduler.UpdateReplicaStatus
as
begin
    set nocount on;
    set xact_abort on;

    declare  @availabilityGroup nvarchar(128);

    select 	@availabilityGroup = ag.AvailabilityGroup
    from 	scheduler.GetAvailabilityGroup() as ag;

    merge scheduler.ReplicaStatus as rs
    using (
        select      ar.replica_server_name, ars.role_desc
        from        sys.dm_hadr_availability_replica_states ars
        inner join  sys.availability_groups ag
        on          ars.group_id = ag.group_id
        join        sys.availability_replicas as ar
        on          ar.replica_id = ars.replica_id
        where       ag.name = @availabilityGroup
    ) as src
    on src.replica_server_name = rs.HostName
    and	src.role_desc = @availabilityGroup
    when matched then
        update
            set rs.AvailabilityGroupRole = src.role_desc
    when not matched then
        insert ( AvailabilityGroup, HostName, AvailabilityGroupRole )
        values ( @availabilityGroup, src.replica_server_name, src.role_desc )
    when not matched by source
        then delete;
end

Get cached role

create or alter function scheduler.GetCachedAvailabilityGroupRole
(
	@availabilityGroupName nvarchar(128)
)
returns nvarchar(60)
as
begin
	declare @role nvarchar(60);

	select 	@role = rs.AvailabilityGroupRole
	from 	scheduler.ReplicaStatus as rs
	where	rs.AvailabilityGroup = @availabilityGroupName
	and	rs.HostName = host_name();

	return coalesce(@role, N'');
end

Deploying these changes eliminated the blocking behaviour, and we've seen the back of some fairly spectacular blocking chains we were able to produce when a server was under heavy load (yes, that is almost 3 minutes waiting to decide if the task can even run or not!).

Much More Blocking