| Summary: | tdeioslave crash messages | ||
|---|---|---|---|
| Product: | TDE | Reporter: | Darrell <darrella> |
| Component: | tdelibs | Assignee: | Timothy Pearson <kb9vqf> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | bugwatch, darrella, kb9vqf |
| Priority: | P5 | ||
| Version: | R14.0.0 [Trinity] | ||
| Hardware: | Other | ||
| OS: | Linux | ||
| Compiler Version: | TDE Version String: | ||
| Application Version: | Application Name: | ||
|
Description
Darrell
2013-08-13 10:13:11 CDT
I don't know whether the following is related. I tried to logout. From a user's perspective the desktop simply froze. No T-menu, no popup menus, no response. Even my k-eyes applet eyeballs stopped moving. The xsession-error log showed the following: [ksmserver] 'Phase 2 complete' [10:16:41:273] [ksmserver] 'Session discarded' [10:16:41:274] [ksmserver] 'notifySlot' [10:16:41:274] [ksmserver] 'startKilling' [10:16:41:274] [ksmserver] 'completeKilling' [10:16:41:294] [ksmserver] 'completeKilling' [10:16:41:294] [ksmserver] 'completeKilling' [10:16:41:309] WARNING: DCOPReply<>: cast to 'bool' error [dcopserver] DCOP aborting while waiting for answer from 'kded' [ksmserver] 'timeoutQuit' [10:17:01:309] [ksmserver] 'SmsDie timeout, client twin (10e3c6e462000137640384600000030220000)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client kicker (10e3c6e462000137640384700000030220003)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client kaccess (10e3c6e462000137640385200000030220005)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client notification-daemon-tde (10e3c6e462000137640385500000030220007)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client knotify (10e3c6e462000137640385500000030220008)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client kteatime (10e3c6e462000137640385500000030220009)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client kwikdisk (10e3c6e462000137640385600000030220010)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client akregator (10e3c6e462000137640385700000030220011)' [10:17:01:309] [ksmserver] 'SmsDie timeout, client konqueror (10e3c6e462000137640385700000030220012)' [10:17:01:310] [ksmserver] 'SmsDie timeout, client kmix (10e3c6e462000137640385700000030220013)' [10:17:01:310] [ksmserver] 'SmsDie timeout, client klipper (10e3c6e462000137640385700000030220014)' [10:17:01:310] [ksmserver] 'SmsDie timeout, client kttsd (10e3c6e462000137640386000000030220016)' [10:17:01:310] [ksmserver] 'killWM' [10:17:01:310] [ksmserver] 'completeKillingWM' [10:17:01:310] [ksmserver] 'timeoutWMQuit' [10:17:06:313] [ksmserver] 'killingCompleted' [10:17:06:313] At which point I then used Ctrl-Alt-Backspace to terminate the logout. Does reversing the change in tdeioslave/media/medianotifier/medianotifier.cpp from GIT hash 960cc839 (tdebase) resolve the crash? I don't know. After reversing I'll need to watch the xsession-error log for a day or so. Tim, I see the crash message is created in tdelibs/tdeio/tdeio/slavebase.cpp:742: snprintf(buffer, sizeof(buffer), "tdeioslave: ####### CRASH ###### protocol = %s pid = %d signal = %d\n", s_protocol, getpid(), sig); Is there a function to grab the process name to pass as a second %s string parameter? Something similar to getpid(), like getprocessname(). Then I could create a test patch to provide a better clue as to the cause. Right now I don't always see the message, sometimes I see three of them. Without some kind of real-time monitor of the xsession-error log, I'm just guessing when the message appears. Actually, if the test patch succeeded we should push to git permanently. :-) Using a shell script I have been trying to capture the actual process. The script pops up a kdialog the moment a CRASH is found in the .xsession-errors log. The script also captures the current ps ax list. When I receive the kdialog popup I look at the xsession-error log for the PID. Then I look into the process capture. I never find the process ID in the captured process list. I have seen this behavior several tims now. My conclusion is the process is started, executes the CRASH message, and is terminated, all in milliseconds. Always immediately after the full CRASH message, I'll see the following: [tdeinit] PID 4826 terminated. The PID is always the same as that in the CRASH message. That the two messages always appear back-to-back implies this CRASH event likely is happening quite fast. I have not yet been able to notice exactly what I am doing at the moment the kdialog appears. Hmm. Finally I captured a PID. In the xsession-errors log: tdeioslave: ####### CRASH ###### protocol = pid = 3578 signal = 11 /opt/trinity/lib/libtdeio.so.4(_ZN5TDEIO9SlaveBase15sigsegv_handlerEi+0xd0)[0xb746a350] tdeio_media [tdeinit] media /dev/shm/tdesocket-tester/tdelauncher(_ZTI15TQPtrCollection+0x0)[0x8057564] [tdeinit] PID 3578 terminated. In the ps list: 3578 tdeio_media [tdeinit] media /dev/shm/tdesocket-tester/tdelauncher I hope this helps. I've been able to repeat three times the events shared in comment 6. Each time the CRASH PID and the PID in the captured process list are the same. Interestingly, I witnessed the CRASH without doing anything other than just starting the Trinity session. I will try to eliminate background apps in the hope if isolating the culprit. I think I may need to modify the crash handler to generate a proper backtrace, then have you test again (similar to how we tracked down the kdesktop_lock crash issue a while back). Ok. I disabled autostart apps, including kmix, and I still see the crash. The last several times I have been unable to see the PID in the process list. I don't know how the previous times got captured. Possibly I see what is happening. When I run ps ax in konsole I notice a bunch of tdeio_* processes. All but one of those processes point to a socket file in $HOME/.trinity/socket-`uname -n`. The one tdeio_* file that does not point to that same socket file --- a socket file that does exist, is a process named tdeio_media. The process name is truncated because of length, but I can see enough of the process name to see that the referenced socket does not exist. Whatever tdeio_media process that was using that socket likely crashes because the socket no longer exists. I was able to monitor the user's socket directory while starting Trinity. I do see a socket file that exists momentarily and disappears, replaced by a second socket file. The socket file is named similarly to the final socket file, such as tdelauncher??????.slave-socket. The momentary socket file is the one that is part of the tdeio_media process name. Adding the debug handler has resulted in the following output: tdeioslave: ####### CRASH ###### protocol = pid = 13690 signal = 11 [tdeioslave] Obtained 10 stack frames. [tdeioslave] slavebase.cpp:83 print_trace() [tdeioslave] slavebase.cpp:776 TDEIO::SlaveBase::sigsegv_handler() [tdeioslave] ??:0 killpg() [tdeioslave] qapplication.cpp:1244 TQApplication::~TQApplication() [tdeioslave] thumbnail.cpp:123 kdemain() [tdeioslave] tdeinit.cpp:673 launch() [tdeioslave] tdeinit.cpp:1242 handle_launcher_request() [tdeioslave] tdeinit.cpp:1444 handle_requests() [tdeioslave] tdeinit.cpp:1909 main() [tdeioslave] libc-start.c:260 __libc_start_main() Looks like something is going wrong with threading. This might be fixed in GIT hashes 7c0bc39 (qt3) and 097eb25 (tqt3). Can you please test and confirm? Thanks! >Looks like something is going wrong with threading. Does that match my observations in comments 10 & 11 regrarding the momentary/missing socket file? >Can you please test and confirm? Only rebuild tqt3 or tdelibs/tdebase too? I don't know of any test other than observation and observing the momentary appearance of the socket files as shared in comments 10 & 11. I suppose that the lone tdeio_media process I see should at least be using an existing socket file. :-) I rebuilt binutils to include demangle.h so I can provide backtrace support. So despite the discussion for a demangle-less patch to tdelibs, I don't need that patch for this bug report. :-) After rebuilding tqt3/tdelibs/tdebase --- and binutils: When starting Trinity, a momentary socket file is still created and then disappears. I can observe that by watching the user's $TDEHOME while Trinity is loading. After Trinity has loaded there is a final single socket file and all tdeio* processes have that final socket file in the process name, including the single tdeio_media process I see. My CRASH detection script again notified me that a CRASH message occurred in my xsession-errors log. The good news is we now know the tdeio_media process is the culprit. I can find that PID in the process list before the CRASH occurs and that PID always matches the CRASH PID. Unlike previously, after I receive my script notification that a CRASH message occurred, the tdeio_media process now terminates. I don't know whether that is good or bad. I now can repeat the detection process: I autostart my script when starting Trinity, wait for several minutes --- the wait is always only several minutes, then confirm the CRASH message is caused by the original tdeio_media process, which now terminates after the CRASH message. I don't think anything I do triggers the CRASH message. Several times I have let the system just sit idle after starting Trinity and I still get notified. I think this CRASH is something happening deeper. That said, there is no backtrace being created. I don't see any in the xsession-errors log and the kcrash dialog never appears. As I have demangle.h installed, tdelibs built without error, and I'm not patching anything extra, I presumed a kcrash dialog would now appear automatically. The plot thickens. I added a timer to my detection script. The CRASH message always occurs at about 6 minutes 30 seconds, varying from about 6:29 to about 6:38. This timer is within the script and not the exact moment Trinity starts loading, but the difference probably is no more than a second or two. I disabled all autostart apps including debusnotification. Same results. As the PID in question is always a process named tdeio_media, I disabled the arts sound system from starting. Same results. Still no bactrace being produced. Here is my detection script. I place the script in a desktop file in the Trinity Autostart directory. After Trinity starts I open konsole and run the following:
ps ax | grep tdeio_
That reveals the PID of the single tdeio_media process running on my systems. I leave konsole open and then about 6:30 minutes later the script kdialog appears, which contains the timer and PID. I then run the same konsole command and notice the tdeio_media process is gone.
=============================
#!/bin/sh
LOG_FILE=$HOME/.xsession-errors
CRASH_ERROR=""
START_TIME="`date +%s`"
while [ -z "$CRASH_ERROR" ]; do
CRASH_ERROR="`grep CRASH $LOG_FILE`"
sleep .05
done
END_TIME="`date +%s`"
TOTAL_TIME=$(($END_TIME - $START_TIME))
TOTAL_TIME="$(($TOTAL_TIME / 60 )) minutes $(($TOTAL_TIME % 60 )) seconds"
CRASH_PID="`grep CRASH $LOG_FILE | awk -F 'pid = ' '{print $2}' | sed 's| signal = 11||'`"
kdialog --display $DISPLAY --title "tdeioslave Crash Report" --msgbox "A tdeoslave crash occured. PID: $CRASH_PID. About $TOTAL_TIME."
=============================
I just saw the latest tqt3 patch related to this bug report and rebuilt. My detection script no longer detects any CRASH messages in my xsession-errors log because, thus far after rebuilding tqt3, none occurs. First: Good job! :-) Two questions: 1) When should I see a backtrace? The code is there but I never saw a backtrace after rebuilding tdelibs. Now that tqt3 is rebuilt and I don't see any CRASH messages, I don't know that I will ever see a backtrace. 2) Have the recent patches resolved the CRASH root cause or do the patches quietly "ignore" the CRASHes so they are not relayed to stdout/stderr? (In reply to comment #18) > I just saw the latest tqt3 patch related to this bug report and rebuilt. > > My detection script no longer detects any CRASH messages in my xsession-errors > log because, thus far after rebuilding tqt3, none occurs. > > First: Good job! :-) > > Two questions: > > 1) When should I see a backtrace? The code is there but I never saw a backtrace > after rebuilding tdelibs. Now that tqt3 is rebuilt and I don't see any CRASH > messages, I don't know that I will ever see a backtrace. You should see a backtrace only when you hack the tdeioslavebase.cpp file to #undef NDEBUG. Personally I'd like to see this fixed so that backtraces are enabled at all times in the crash handler, as it is not helpful to either the users or developers to know that a tdeioslave has crashed with no additional information. > 2) Have the recent patches resolved the CRASH root cause or do the patches > quietly "ignore" the CRASHes so they are not relayed to stdout/stderr? The root cause has been fixed. I don't hide things like this; they would just come back later with an even harder to trace failure. :-) Tim Do I have to explicitly declare -DWITH_NDEBUG=ON in the build script configuration options? Never mind. I think I see how this works. NDEBUG gets declared outside the tdelibs configuration options. The patch looks for ifndef NDEBUG, which normally should be true. Then in the tdelibs configuration checks, if execinfo.h exists the environment variable HAVE_BACKTRACE is enabled. If that explanation is correct, then I remain stumped why I never saw a backtrace being generated. (In reply to comment #21) > Never mind. I think I see how this works. NDEBUG gets declared outside the > tdelibs configuration options. The patch looks for ifndef NDEBUG, which > normally should be true. Then in the tdelibs configuration checks, if > execinfo.h exists the environment variable HAVE_BACKTRACE is enabled. > > If that explanation is correct, then I remain stumped why I never saw a > backtrace being generated. Because NDEBUG is true. NDEBUG stands for Not Debug AFAICT, and the backtrace generation code is placed within an #ifndef NDEBUG block (i.e. if NDEBUG is NOT 1, then the code is included). As I mentioned earlier, I personally think the NDBUG compile time options should be removed from around the backtrace generation code, so that useful backtraces can be posted by normal users if needed. Is this issue resolved? The original report: yes, comment 18. In the discussion you mentioned a problem with how NDEBUG is handled --- I never saw any backtraces while helping with problem despite the explicit introduction of backtrace support code. Tim, We can close this report as resolved, but we should open a new report, or enhancement request, addressing your concerns about NDEBUG. As you are familiar with the issue, would you open that report so the problem description is correct? I'll then close this report as resolved. I'm tagging this report as resolved in the R14 etherpad road map. Bug 1710 opened to address the concerns about NDEBUG and kcrash. |