TRANSACTION_MUTEX wait in MS SQL Server

A couple of days ago, I was informed about performance degradation which periodically happened  in our production system of MS SQL Server 2012 instance running under Windows Server 2012R2. QUERYSTORE would be good solution for monitoring performance for the situation, however it comes  in SQL Server 2014 version.  Because there is not QUERYSTORE technology in SQL Server 2012, lets create manual sampling by using  BRENTOZAR  script sp_BlitzFirst. I have created JOB which runs every 15 minute and calls the sp_BlitzFirst. Please see below:

USE [msdb]
GO
DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'DBAtools_Sampling',
@enabled=0,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'No description available.',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=N'SA', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode<> 0) GOTO QuitWithRollback

EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Step1',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'TSQL',
@command=N'EXEC sp_BlitzFirst
@Seconds = 30,
@OutputDatabaseName = ''DBAtools'',
@OutputSchemaName = ''dbo'',
@OutputTableName = ''BlitzFirst'',
@OutputTableNameFileStats = ''BlitzFirst_FileStats'',
@OutputTableNamePerfmonStats = ''BlitzFirst_PerfmonStats'',
@OutputTableNameWaitStats = ''BlitzFirst_WaitStats'',
@OutputTableNameBlitzCache = ''BlitzCache''',
@database_name=N'DBAtools',
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'Schedule1',
@enabled=1,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=15,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20190620,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959,
@schedule_uid=N'1d27b7fb-ae33-4334-94f2-7b9101692760'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

About  sp_BlitzFirst  you can read from BRENTOZR  blog. It creates a couple of tables and views in the given database for periodically sampling (30 seconds in our situation) from SQL Server internal performance views like  dm_os_wait_stats, dm_io_virtual_file_stats and etc.

After sampling we can select the blitzfirst_waitstats  table and see which are the most occurred waits.

select [wait_type],
 sum([wait_time_ms]) / 1000 / 3600 wt
  from [dbatools].[dbo].[blitzfirst_waitstats] group by [wait_type] 
    order by wt desc

wait_type	         wt
TRANSACTION_MUTEX	30024
BACKUPIO	        2504
TRACEWRITE	        1792
ASYNC_NETWORK_IO	1179
SOS_SCHEDULER_YIELD	906
CXPACKET	        753
LCK_M_IS	        696
LCK_M_SCH_M	        630
WRITELOG	        513
LCK_M_SCH_S	        452
LCK_M_IX	        319
...........

As we see the TRANSACTION_MUTEX is the top wait. The Books Online definition of TRANSACTION_MUTEX is that it “occurs during synchronization of access to a transaction by multiple batches. Lets go deeper and catch the query which generates TRANSACTION_MUTEX wait. For doing that I am going to use Extended Events technology.  

First of all I got the code of TRANSACTION_MUTEX wait 

SELECT
[map_key]
FROM sys.dm_xe_map_values
WHERE [name] = N'wait_types'
AND [map_value] = N'TRANSACTION_MUTEX';

map_key
---------
210

Then created Extended Event for catching Query which generate TRANSACTION_MUTEX wait.

CREATE EVENT SESSION [InvestigateWaits] ON SERVER 
ADD EVENT sqlos.wait_info(
    ACTION(sqlserver.database_id,sqlserver.plan_handle,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack,sqlserver.username)
    WHERE ([wait_type]=(210) AND [opcode]=(1))) 
ADD TARGET package0.ring_buffer
WITH (MAX_MEMORY=51200 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

Activate created Extended Event and open Watch Live Data page.


We should see window like below.

After monitoring some period of time, At last I have got that query. The query and its plan shown below.

SELECT 
    [GroupBy1].[A1] AS [C1]
    FROM ( SELECT 
        COUNT(1) AS [A1]
        FROM  [dbo].[VW_DOCS] AS [Extent1]
        LEFT OUTER JOIN [dbo].[DOCS_READSTATUS] AS [Extent2] ON ([Extent1].[DocId] = [Extent2].[DocId]) AND ([Extent2].[WorkplaceId] = 24)
        WHERE ([Extent1].[DocDoctypeId] = 1) AND ([Extent1].[DocPeriodId] = 3) AND (([Extent1].[DocOrganizationId] = 24) OR ( EXISTS (SELECT 
            1 AS [C1]
            FROM  [dbo].[DOCS_EXECUTOR] AS [Extent3]
            INNER JOIN [dbo].[DOCS_DIRECTIONS] AS [Extent4] ON [Extent3].[ExecutorDirectionId] = [Extent4].[DirectionId]
            WHERE ([Extent1].[DocId] = [Extent3].[ExecutorDocId]) AND ((([Extent3].[ExecutorWorkplaceId] = 24) AND (1 = [Extent4].[DirectionConfirmed])) OR (([Extent3].[ExecutorOrganizationId] = 24) AND (1 = [Extent4].[DirectionConfirmed])))
        ))) AND (( NOT ((2 =  CAST( [Extent1].[DocDeleted] AS int)) AND ([Extent1].[DocDeleted] IS NOT NULL))) OR ((2 =  CAST( [Extent1].[DocDeleted] AS int)) AND ([Extent1].[DocInsertedById] = 24))) AND (1 = 1)
    )  AS [GroupBy1] 

It is one of the most usable query and it takes about 3 seconds to complete. From the query plan we see expensive Nested Loop join which will generate concurrency in case of many query like this will be running. However, If we change join method to Hash Join by using SQL Server hint OPTION(HASH JOIN), we will decrease concurrency and query run time (less then one second). Corrected version of query and its plan shown below.

 
SELECT 
    [GroupBy1].[A1] AS [C1]
    FROM ( SELECT 
        COUNT(1) AS [A1]
        FROM  [dbo].[VW_DOCS] AS [Extent1]
        LEFT OUTER JOIN [dbo].[DOCS_READSTATUS] AS [Extent2] ON ([Extent1].[DocId] = [Extent2].[DocId]) AND ([Extent2].[WorkplaceId] = 24)
        WHERE ([Extent1].[DocDoctypeId] = 1) AND ([Extent1].[DocPeriodId] = 3) AND (([Extent1].[DocOrganizationId] = 24) OR ( EXISTS (SELECT 
            1 AS [C1]
            FROM  [dbo].[DOCS_EXECUTOR] AS [Extent3]
            INNER JOIN [dbo].[DOCS_DIRECTIONS] AS [Extent4] ON [Extent3].[ExecutorDirectionId] = [Extent4].[DirectionId]
            WHERE ([Extent1].[DocId] = [Extent3].[ExecutorDocId]) AND ((([Extent3].[ExecutorWorkplaceId] = 24) AND (1 = [Extent4].[DirectionConfirmed])) OR (([Extent3].[ExecutorOrganizationId] = 24) AND (1 = [Extent4].[DirectionConfirmed])))
        ))) AND (( NOT ((2 =  CAST( [Extent1].[DocDeleted] AS int)) AND ([Extent1].[DocDeleted] IS NOT NULL))) OR ((2 =  CAST( [Extent1].[DocDeleted] AS int)) AND ([Extent1].[DocInsertedById] = 24))) AND (1 = 1)
    )  AS [GroupBy1] 
	OPTION (HASH JOIN)


Problem has been fixed. Thank you for attention 🙂


 

A Real life performance issue due to Fetch Next N rows in MS SQL Server

A couple of days ago, I was asked to improve the performance of a particular SQL query which took, in average, 10 second to complete. Here’s below the corresponding SQL and It’s  plan taken from a MS SQL Server 2017 instance running under Windows Server 2016 Standart.

SELECT [x].*
FROM [NotA] AS [x]
WHERE [x].[IsDeleted] = 0
ORDER BY [x].[Id]
OFFSET 5144530 ROWS FETCH NEXT 10 ROWS ONLY


NotA table has primary key Clustered Index on ID column. From the SQL Plan statistics we can observe that more then 5 million Index rows were scanned. But why almost all Index rows were scanned, despite that we read just 10 rows from the Table. Each Index leaf node has address for the next leaf node, the next one for the next leaf node and etc. The more value of OFFSET parameter  the more Index rows were scanned. Let’s see the SQL Plan if we set smaller value for OFFSET parameter.

SELECT [x].*
FROM [NotA] AS [x]
WHERE [x].[IsDeleted] = 0
ORDER BY [x].[Id]
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY

Just 20 rows were scanned and SQL takes less then 1 second to complete. But what should we do If we need to set big value for OFFSET parameter.

Solution for the situation will be using the exact value by BETWEEN statement in the query.  The SQL query and It’s Plan shown below.

SELECT [x].*
FROM [NotA] AS [x]
WHERE [x].[IsDeleted] = 0 and
[x].[Id] between 5144530 and 5144530 + 10
ORDER BY [x].[Id]
--OFFSET 5144530 ROWS FETCH NEXT 10 ROWS ONLY 

SQL query takes less then 1 second to complete.

In case of data is not sequential using approach like 5144530 and 5144530 + 10 is not correct of course. In that case you should use approach like below:

 
SELECT [x].* FROM [NotA] AS [x] WHERE [x].[IsDeleted] = 0 and [x].[Id] in 
(SELECT [x].id FROM [NotA] AS [x] WHERE [x].[IsDeleted] = 0 ORDER BY [x].[Id] 
OFFSET 5144530 ROWS FETCH NEXT 10 ROWS ONLY)

That’s all thanks for attention 🙂