By default, Bugzilla does not search the list of RESOLVED bugs.
You can force it to do so by putting the upper-case word ALL in front of your search query, e.g.: ALL tdelibs
We recommend searching for bugs this way, as you may discover that your bug has already been resolved and fixed in a later release.

Bug 1627

Summary: tdeioslave crash messages
Product: TDE Reporter: Darrell <darrella>
Component: tdelibsAssignee: Timothy Pearson <kb9vqf>
Status: RESOLVED FIXED    
Severity: normal CC: bugwatch, darrella, kb9vqf
Priority: P5    
Version: R14.0.0 [Trinity]   
Hardware: Other   
OS: Linux   
Compiler Version: TDE Version String:
Application Version: Application Name:

Description Darrell 2013-08-13 10:13:11 CDT
With a full clean rebuild that includes recent patches against tdeio slaves and session managerment, I now see the following messages in my xsession-errors log:

tdeioslave: ####### CRASH ###### protocol =  pid = 29858 signal = 11
/opt/trinity/lib/libtdeio.so.4(_ZN5TDEIO9SlaveBase15sigsegv_handlerEi+0xd0)[0xb742f350]
tdeio_media [tdeinit] media /dev/shm/tdesocket-root/tdelauncherM6C(_ZTI15TQPtrCollection+0x0)[0x8057564]

The pid changes but the remainder of the message string remains the same.

At the moment I'm not noticing any real crashes although I haven't tried testing extensively. As the patches are recent I'll add this bug report to the R14 etherpad.

One note, would be nice if the crash message included a process name in addition to the pid. When the message appears the process is already terminated and therefore the pid provides no useful information.
Comment 1 Darrell 2013-08-13 10:30:23 CDT
I don't know whether the following is related. I tried to logout. From a user's perspective the desktop simply froze. No T-menu, no popup menus, no response. Even my k-eyes applet eyeballs stopped moving.

The xsession-error log showed the following:

[ksmserver] 'Phase 2 complete' [10:16:41:273]
[ksmserver] 'Session discarded' [10:16:41:274]
[ksmserver] 'notifySlot' [10:16:41:274]
[ksmserver] 'startKilling' [10:16:41:274]
[ksmserver] 'completeKilling' [10:16:41:294]
[ksmserver] 'completeKilling' [10:16:41:294]
[ksmserver] 'completeKilling' [10:16:41:309]
WARNING: DCOPReply<>: cast to 'bool' error
[dcopserver] DCOP aborting while waiting for answer from 'kded'
[ksmserver] 'timeoutQuit' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client twin (10e3c6e462000137640384600000030220000)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client kicker (10e3c6e462000137640384700000030220003)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client kaccess (10e3c6e462000137640385200000030220005)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client notification-daemon-tde (10e3c6e462000137640385500000030220007)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client knotify (10e3c6e462000137640385500000030220008)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client kteatime (10e3c6e462000137640385500000030220009)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client kwikdisk (10e3c6e462000137640385600000030220010)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client akregator (10e3c6e462000137640385700000030220011)' [10:17:01:309]
[ksmserver] 'SmsDie timeout, client konqueror (10e3c6e462000137640385700000030220012)' [10:17:01:310]
[ksmserver] 'SmsDie timeout, client kmix (10e3c6e462000137640385700000030220013)' [10:17:01:310]
[ksmserver] 'SmsDie timeout, client klipper (10e3c6e462000137640385700000030220014)' [10:17:01:310]
[ksmserver] 'SmsDie timeout, client kttsd (10e3c6e462000137640386000000030220016)' [10:17:01:310]
[ksmserver] 'killWM' [10:17:01:310]
[ksmserver] 'completeKillingWM' [10:17:01:310]
[ksmserver] 'timeoutWMQuit' [10:17:06:313]
[ksmserver] 'killingCompleted' [10:17:06:313]

At which point I then used Ctrl-Alt-Backspace to terminate the logout.
Comment 2 Timothy Pearson 2013-08-13 12:28:38 CDT
Does reversing the change in tdeioslave/media/medianotifier/medianotifier.cpp from GIT hash 960cc839 (tdebase) resolve the crash?
Comment 3 Darrell 2013-08-13 13:48:52 CDT
I don't know. After reversing I'll need to watch the xsession-error log for a day or so.
Comment 4 Darrell 2013-08-13 17:58:07 CDT
Tim,

I see the crash message is created in tdelibs/tdeio/tdeio/slavebase.cpp:742:

snprintf(buffer, sizeof(buffer), "tdeioslave: ####### CRASH ###### protocol = %s pid = %d signal = %d\n", s_protocol, getpid(), sig);

Is there a function to grab the process name to pass as a second %s string parameter? Something similar to getpid(), like getprocessname(). Then I could create a test patch to provide a better clue as to the cause. Right now I don't always see the message, sometimes I see three of them. Without some kind of real-time monitor of the xsession-error log, I'm just guessing when the message appears.

Actually, if the test patch succeeded we should push to git permanently. :-)
Comment 5 Darrell 2013-08-14 19:22:59 CDT
Using a shell script I have been trying to capture the actual process. The script pops up a kdialog the moment a CRASH is found in the .xsession-errors log. The script also captures the current ps ax list.

When I receive the kdialog popup I look at the xsession-error log for the PID. Then I look into the process capture. I never find the process ID in the captured process list. I have seen this behavior several tims now. My conclusion is the process is started, executes the CRASH message, and is terminated, all in milliseconds. Always immediately after the full CRASH message, I'll see the following:

[tdeinit] PID 4826 terminated.

The PID is always the same as that in the CRASH message. That the two messages always appear back-to-back implies this CRASH event likely is happening quite fast.

I have not yet been able to notice exactly what I am doing at the moment the kdialog appears.
Comment 6 Darrell 2013-08-14 20:03:00 CDT
Hmm. Finally I captured a PID. In the xsession-errors log:

tdeioslave: ####### CRASH ###### protocol =  pid = 3578 signal = 11
/opt/trinity/lib/libtdeio.so.4(_ZN5TDEIO9SlaveBase15sigsegv_handlerEi+0xd0)[0xb746a350]
tdeio_media [tdeinit] media /dev/shm/tdesocket-tester/tdelauncher(_ZTI15TQPtrCollection+0x0)[0x8057564]
[tdeinit] PID 3578 terminated.


In the ps list:

3578 tdeio_media [tdeinit] media /dev/shm/tdesocket-tester/tdelauncher

I hope this helps.
Comment 7 Darrell 2013-08-14 23:32:28 CDT
I've been able to repeat three times the events shared in comment 6. Each time the CRASH PID and the PID in the captured process list are the same. Interestingly, I witnessed the CRASH without doing anything other than just starting the Trinity session. I will try to eliminate background apps in the hope if isolating the culprit.
Comment 8 Timothy Pearson 2013-08-14 23:37:12 CDT
I think I may need to modify the crash handler to generate a proper backtrace, then have you test again (similar to how we tracked down the kdesktop_lock crash issue a while back).
Comment 9 Darrell 2013-08-15 00:26:51 CDT
Ok. I disabled autostart apps, including kmix, and I still see the crash. The last several times I have been unable to see the PID in the process list. I don't know how the previous times got captured.
Comment 10 Darrell 2013-08-15 01:10:32 CDT
Possibly I see what is happening. When I run ps ax in konsole I notice a bunch of tdeio_* processes. All but one of those processes point to a socket file in $HOME/.trinity/socket-`uname -n`. The one tdeio_* file that does not point to that same socket file --- a socket file that does exist, is a process named tdeio_media. The process name is truncated because of length, but I can see enough of the process name to see that the referenced socket does not exist. Whatever tdeio_media process that was using that socket likely crashes because the socket no longer exists.
Comment 11 Darrell 2013-08-15 11:13:56 CDT
I was able to monitor the user's socket directory while starting Trinity. I do see a socket file that exists momentarily and disappears, replaced by a second socket file. The socket file is named similarly to the final socket file, such as tdelauncher??????.slave-socket. The momentary socket file is the one that is part of the tdeio_media process name.
Comment 12 Timothy Pearson 2013-08-15 15:35:31 CDT
Adding the debug handler has resulted in the following output:

tdeioslave: ####### CRASH ###### protocol =  pid = 13690 signal = 11
[tdeioslave] Obtained 10 stack frames.
[tdeioslave] slavebase.cpp:83	print_trace()
[tdeioslave] slavebase.cpp:776	TDEIO::SlaveBase::sigsegv_handler()
[tdeioslave] ??:0	killpg()
[tdeioslave] qapplication.cpp:1244	TQApplication::~TQApplication()
[tdeioslave] thumbnail.cpp:123	kdemain()
[tdeioslave] tdeinit.cpp:673	launch()
[tdeioslave] tdeinit.cpp:1242	handle_launcher_request()
[tdeioslave] tdeinit.cpp:1444	handle_requests()
[tdeioslave] tdeinit.cpp:1909	main()
[tdeioslave] libc-start.c:260	__libc_start_main()

Looks like something is going wrong with threading.
Comment 13 Timothy Pearson 2013-08-15 15:42:26 CDT
This might be fixed in GIT hashes 7c0bc39 (qt3) and 097eb25 (tqt3).  Can you please test and confirm?

Thanks!
Comment 14 Darrell 2013-08-15 16:32:44 CDT
>Looks like something is going wrong with threading.
Does that match my observations in comments 10 & 11 regrarding the momentary/missing socket file?

>Can you please test and confirm?
Only rebuild tqt3 or tdelibs/tdebase too?

I don't know of any test other than observation and observing the momentary appearance of the socket files as shared in comments 10 & 11. I suppose that the lone tdeio_media process I see should at least be using an existing socket file. :-)
Comment 15 Darrell 2013-08-15 21:50:20 CDT
I rebuilt binutils to include demangle.h so I can provide backtrace support. So despite the discussion for a demangle-less patch to tdelibs, I don't need that patch for this bug report. :-)

After rebuilding tqt3/tdelibs/tdebase --- and binutils:

When starting Trinity, a momentary socket file is still created and then disappears. I can observe that by watching the user's $TDEHOME while Trinity is loading. After Trinity has loaded there is a final single socket file and all tdeio* processes have that final socket file in the process name, including the single tdeio_media process I see.

My CRASH detection script again notified me that a CRASH message occurred in my xsession-errors log. The good news is we now know the tdeio_media process is the culprit. I can find that PID in the process list before the CRASH occurs and that PID always matches the CRASH PID.

Unlike previously, after I receive my script notification that a CRASH message occurred, the tdeio_media process now terminates. I don't know whether that is good or bad.

I now can repeat the detection process: I autostart my script when starting Trinity, wait for several minutes --- the wait is always only several minutes, then confirm the CRASH message is caused by the original tdeio_media process, which now terminates after the CRASH message.

I don't think anything I do triggers the CRASH message. Several times I have let the system just sit idle after starting Trinity and I still get notified. I think this CRASH is something happening deeper.

That said, there is no backtrace being created. I don't see any in the xsession-errors log and the kcrash dialog never appears. As I have demangle.h installed, tdelibs built without error, and I'm not patching anything extra, I presumed a kcrash dialog would now appear automatically.
Comment 16 Darrell 2013-08-16 11:18:07 CDT
The plot thickens. I added a timer to my detection script. The CRASH message always occurs at about 6 minutes 30 seconds, varying from about 6:29 to about 6:38. This timer is within the script and not the exact moment Trinity starts loading, but the difference probably is no more than a second or two.

I disabled all autostart apps including debusnotification. Same results. As the PID in question is always a process named tdeio_media, I disabled the arts sound system from starting. Same results.

Still no bactrace being produced.
Comment 17 Darrell 2013-08-16 11:41:22 CDT
Here is my detection script. I place the script in a desktop file in the Trinity Autostart directory. After Trinity starts I open konsole and run the following:

ps ax | grep tdeio_

That reveals the PID of the single tdeio_media process running on my systems. I leave konsole open and then about 6:30 minutes later the script kdialog appears, which contains the timer and PID. I then run the same konsole command and notice the tdeio_media process is gone.

=============================
#!/bin/sh

LOG_FILE=$HOME/.xsession-errors
CRASH_ERROR=""

START_TIME="`date +%s`"

while [ -z "$CRASH_ERROR" ]; do
  CRASH_ERROR="`grep CRASH $LOG_FILE`"
  sleep .05
done

END_TIME="`date +%s`"
TOTAL_TIME=$(($END_TIME - $START_TIME))
TOTAL_TIME="$(($TOTAL_TIME / 60 )) minutes $(($TOTAL_TIME % 60 )) seconds"

CRASH_PID="`grep CRASH $LOG_FILE | awk -F 'pid = ' '{print $2}' | sed 's| signal = 11||'`"

kdialog --display $DISPLAY --title "tdeioslave Crash Report" --msgbox "A tdeoslave crash occured. PID: $CRASH_PID. About $TOTAL_TIME."
=============================
Comment 18 Darrell 2013-08-16 12:43:27 CDT
I just saw the latest tqt3 patch related to this bug report and rebuilt.

My detection script no longer detects any CRASH messages in my xsession-errors log because, thus far after rebuilding tqt3, none occurs.

First: Good job! :-)

Two questions:

1) When should I see a backtrace? The code is there but I never saw a backtrace after rebuilding tdelibs. Now that tqt3 is rebuilt and I don't see any CRASH messages, I don't know that I will ever see a backtrace.

2) Have the recent patches resolved the CRASH root cause or do the patches quietly "ignore" the CRASHes so they are not relayed to stdout/stderr?
Comment 19 Timothy Pearson 2013-08-16 14:01:29 CDT
(In reply to comment #18)
> I just saw the latest tqt3 patch related to this bug report and rebuilt.
> 
> My detection script no longer detects any CRASH messages in my xsession-errors
> log because, thus far after rebuilding tqt3, none occurs.
> 
> First: Good job! :-)
> 
> Two questions:
> 
> 1) When should I see a backtrace? The code is there but I never saw a backtrace
> after rebuilding tdelibs. Now that tqt3 is rebuilt and I don't see any CRASH
> messages, I don't know that I will ever see a backtrace.

You should see a backtrace only when you hack the tdeioslavebase.cpp file to #undef NDEBUG.  Personally I'd like to see this fixed so that backtraces are enabled at all times in the crash handler, as it is not helpful to either the users or developers to know that a tdeioslave has crashed with no additional information.

> 2) Have the recent patches resolved the CRASH root cause or do the patches
> quietly "ignore" the CRASHes so they are not relayed to stdout/stderr?

The root cause has been fixed.  I don't hide things like this; they would just come back later with an even harder to trace failure. :-)

Tim
Comment 20 Darrell 2013-08-16 14:09:32 CDT
Do I have to explicitly declare -DWITH_NDEBUG=ON in the build script configuration options?
Comment 21 Darrell 2013-08-16 19:45:12 CDT
Never mind. I think I see how this works. NDEBUG gets declared outside the tdelibs configuration options. The patch looks for ifndef NDEBUG, which normally should be true. Then in the tdelibs configuration checks, if execinfo.h exists the environment variable HAVE_BACKTRACE is enabled.

If that explanation is correct, then I remain stumped why I never saw a backtrace being generated.
Comment 22 Timothy Pearson 2013-08-16 22:35:42 CDT
(In reply to comment #21)
> Never mind. I think I see how this works. NDEBUG gets declared outside the
> tdelibs configuration options. The patch looks for ifndef NDEBUG, which
> normally should be true. Then in the tdelibs configuration checks, if
> execinfo.h exists the environment variable HAVE_BACKTRACE is enabled.
> 
> If that explanation is correct, then I remain stumped why I never saw a
> backtrace being generated.

Because NDEBUG is true.  NDEBUG stands for Not Debug AFAICT, and the backtrace generation code is placed within an #ifndef NDEBUG block (i.e. if NDEBUG is NOT 1, then the code is included).  As I mentioned earlier, I personally think the NDBUG compile time options should be removed from around the backtrace generation code, so that useful backtraces can be posted by normal users if needed.
Comment 23 Timothy Pearson 2013-09-09 00:45:20 CDT
Is this issue resolved?
Comment 24 Darrell 2013-09-09 11:30:57 CDT
The original report: yes, comment 18.

In the discussion you mentioned a problem with how NDEBUG is handled --- I never saw any backtraces while helping with problem despite the explicit introduction of backtrace support code.
Comment 25 Darrell 2013-10-03 13:31:05 CDT
Tim,

We can close this report as resolved, but we should open a new report, or enhancement request, addressing your concerns about NDEBUG. As you are familiar with the issue, would you open that report so the problem description is correct? I'll then close this report as resolved.

I'm tagging this report as resolved in the R14 etherpad road map.
Comment 26 Darrell 2013-11-12 17:12:30 CST
Bug 1710 opened to address the concerns about NDEBUG and kcrash.