One more common and difficult to deal issue – when users logoff from their session on TS\XenApp server, three processes stuck there – Csrss, Winlogon and LogonUI. Though they logged off but because of these processes, user’s session stuck, eating resources and at some stage give unexpected behaviour…It is bit difficult to show the full stack here but I am documenting the technique that can be used to find the root cause of these types of issue.
Tool – Windbg
Complete Memory Dump – for hang related issue, it is good idea to take complete memory dump and atleast 2-3 to see the consistency.
Step 1 – Ensure symbols are loaded, may be a good idea to run lm command and too see what files are loaded and then run .reload /f to force the symbol download.
Step 2 – Find out all the processes, it will be good to have them in sort order by session so run command !sprocess -4 - This will show all the sessions in proper order and also, what all processes are available in each session.
Step 3 – Now, some manual work, looked into each session and checked for sessions with just these three processes, maybe copy-paste on a notepad. I found atleast 6-7 sessions.
SessionId: 1 Cid: 038c Peb: 7fffffd9000 ParentCid: 037c
DirBase: 1ad871000 ObjectTable: fffff8a001c729e0 HandleCount: 79.
Step 4 – Now, check all thread of each Csrss & Winlogon process for each hung session – !process <process-ID> ff (!process fffffa800a68c2e0 ff)
Step 5 – This will show informations related to this process, all threads active (??)…
Step 6 – Now look into each thread and check for ALPC wait chain message…somethg like below
THREAD fffffa800a696700 Cid 038c.0398 Teb: 000007fffffdc000 Win32Thread: fffff900c01bf360 WAIT: (WrLpcReply) UserMode Non-Alertable
fffffa800a696ac0 Semaphore Limit 0×1
Waiting for reply to ALPC Message fffff8a003053d00 : queued at port fffffa800aa539e0 : owned by process fffffa8009d86630
Owning Process fffffa800a68c2e0 Image: csrss.exe
Attached Process N/A Image: N/A
Wait Start TickCount 3409868 Ticks: 2007111 (0:08:42:41.109)
Step 7 – Now important line in above is - Waiting for reply to ALPC Message fffff8a003053d00 : queued at port fffffa800aa539e0 : owned by process fffffa8009d86630
Step 8 – To check the alpc message for it run the command – !alpc /m fffff8a003053d00
0: kd> !alpc /m fffff8a003053d00
Message @ fffff8a003053d00
MessageID : 0×0050 (80)
CallbackID : 0x034F (847)
SequenceNumber : 0×00000002 (2)
Type : LPC_REQUEST
DataLength : 0×0128 (296)
TotalLength : 0×0150 (336)
Canceled : No
Release : No
ReplyWaitReply : No
Continuation : Yes
OwnerPort : fffffa800aa5ce60 [ALPC_CLIENT_COMMUNICATION_PORT]
WaitingThread : fffffa800a696700
QueueType : ALPC_MSGQUEUE_PENDING
QueuePort : fffffa800aa539e0 [ALPC_CONNECTION_PORT]
QueuePortOwnerProcess : fffffa8009d86630 (lsm.exe)
ServerThread : fffffa800aa5d060
QuotaCharged : No
Step 9 – Some of the important thing in above o/p are in Bold
Step 10 – Now, it seems that Csrss – has alpc wait chain -> on lsm.exe
Step 11 – To go further, we can look into more details to the ServerThread to see what all components are there in its stack.
Step 12 – so run command – !thread fffffa800aa5d060
Step 13 – Here in the o/p of this command I could see the whole stack and by going through each components (remember bottom-up), I found that at last a third party components has made a call and then it just wait on loop.
To summarize – System hang > orphan process > Csrss > ALPC wait > lsm.exe > 3rd-part component… just asked them to check it and issue is pretty much resolved.
This is one of the technique but you also need to look into locks (!cs -l) just to confirm if there is any dead-lock or not… Here’s the pretty good list of things to check – http://www.dumpanalysis.org/blog/index.php/2007/06/20/crash-dump-analysis-checklist/