Very Strange and Probably Obscure Problem

Questions about Wine on Linux
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Very Strange and Probably Obscure Problem

Post by eliwap »

Hi...

I really hope that somebody can help me to brainstorm this problem. Please let me describe my enviornment.

Environment
---------------
Fedora 16 x86_64(fully up-to-date)
Wine 1.5.1 32bit and noarch binaries only
Nvidia GPU - GTS 450
Nvidia Driver - 295.40 (including 32bit libraries)

I've installed Perfectworld International under wine. Everything was running fine until the most recent major update called (descent). It took awhile to figure out the requirments and once done. I made a clean install of the game including a fresh wine prefix. I configured the registry as required.

Now for the longest time I was unable to run the game at all. When I tried to do it manually, I constantly got the following error message:

winehq err:seh:raise_exception Unhandled exception code c0000005 flags 0 addr

I stumbled onto a workaround with an uninstall and reinstall of xorg-x11-drv-nvidia-libs.i686. The game then works for a while without a problem. But after awhile the game ceases to work. Until I do reinstall of the library again.

Two things that are very funny about this. First thing is that recently I placed the hardware into a computer with much more capable hardware. This problem began in the older hardware and persisted in the newer hardware.

The second thing that's funny about this is that I can run the game without a problem on a different machine with the exact same OS and software configuration (see above).

I really do hope that somebody out there can help me to try to figure out what's going on. Is it a wine problem, glibc problem, kernel problem, hardware problem........

Please help.

Thanks

Eli Wapniarski
User avatar
dimesio
Moderator
Moderator
Posts: 13207
Joined: Tue Mar 25, 2008 10:30 pm

Re: Very Strange and Probably Obscure Problem

Post by dimesio »

eliwap wrote: Two things that are very funny about this. First thing is that recently I placed the hardware into a computer with much more capable hardware. This problem began in the older hardware and persisted in the newer hardware.
Exactly what hardware did you move?
The second thing that's funny about this is that I can run the game without a problem on a different machine with the exact same OS and software configuration (see above).
And what is the difference in hardware?

This sounds like a hardware problem. Whether it's a particular piece of hardware that's failing, or a problem with a particular model of something, I can't tell from what you've posted.
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

Thanks for responding.

The only hardware that moved were my hard drives and my video card. As I said, the problem occurred in both machines.

And I have a hard time believing that its a hardware problem, because everything else seems to work just fine. The computer doesn't crash. This wine issue seems to be the only trouble that I'm having.

Is there a way to check the physical health of the nvidia card?

Eli
User avatar
dimesio
Moderator
Moderator
Posts: 13207
Joined: Tue Mar 25, 2008 10:30 pm

Post by dimesio »

eliwap wrote: The only hardware that moved were my hard drives and my video card. As I said, the problem occurred in both machines.

And I have a hard time believing that its a hardware problem, because everything else seems to work just fine. The computer doesn't crash. This wine issue seems to be the only trouble that I'm having.

Is there a way to check the physical health of the nvidia card?
I'm more inclined to suspect the hard drive based on your description of the problem. The fact that reinstalling a file temporarily fixes the problem would be consistent with a hard drive that is developing more and more bad sectors over time. Both Western Digital and Seagate have downloadable utilities that can be burned to a bootable cd to check the health of their hard drives. I don't know of any utilities to check the health of a graphics card; can you swap the card with your other machine? If you do that and the problem follows the card, then you'll know it's the card.
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Tue, 2012-04-17 at 12:33 -0500, dimesio wrote:
I'm more inclined to suspect the hard drive based on your description
of the problem. The fact that reinstalling a file temporarily fixes
the problem would be consistent with a hard drive that is developing
more and more bad sectors over time. Both Western Digital and Seagate
have downloadable utilities that can be burned to a bootable cd to
check the health of their hard drives.
There are also a few standard ways to check a disk with built-in Linux
utilities: smartd, fsck and badblocks. These should work with disks from
any manufacturer.

If the OP hasn't installed smartd, he should consider doing so as its a
non-destructive way to check disk status and get warnings of impending
problems. Once it is installed and running, giving the command:

killall -USR1 smartd

will cause smartd to do an immediate disk scan.

If the disk that's suspected of having problems can be temporarily
mounted on a PC as a second drive (an external USB disk dock is useful
for this if the OP has or can borrow one) it can be checked more
thoroughly by:

- running fsck against the partition(s) containing /home and /usr
with the -f and -n options set. The -f option forces a full scan and
the -n option tells fsck to check the partition and report problems
without attempting to repair them.

- running badblocks against the disk device with the -n option

Both programs should be used with the disk attached to the PC but
without any mounted partitions on the disk, i.e. if the partition(s) get
automounted they *MUST* be unmounted before running either fsck or
badblocks.


Martin
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

k... Thanks for that... I will do these things over the next couple of days.

They are indeed a very good place to start.

Eli
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

k... smartd did indeed report a problem. Drat.

So now currently I am running:

badblocks -v /dev/mapper/vg_computer-lv_root > bad-blocks

I intend to then run:

fsck -l bad-blocks /dev/mapper/vg_computer-lv_root > bad-blocks

If I am making a mistake, or there is a better way to do this, then I would greatly appreciate any advise.

Thank you.
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Wed, 2012-04-18 at 12:03 -0500, eliwap wrote:
k... smartd did indeed report a problem. Drat.
...but it did find it! smartd is light-weight in its resource use, at
least I've never noticed any slow-downs I could blame on it, so IMO
installing and enabling it on all Linux boxes is a no-brainer. I have it
set up to generate and mail me reports on a weekly basis.
So now currently I am running:

badblocks -v /dev/mapper/vg_computer-lv_root > bad-blocks

I intend to then run:

fsck -l bad-blocks /dev/mapper/vg_computer-lv_root > bad-blocks
Shouldn't that be:

fsck -l bad-blocks /dev/mapper/vg_computer-lv_root

I think your shell redirection would clear the badblocks list in
preparation for writing to it, so you'd empty the file before fsck could
apply it.

If I am making a mistake, or there is a better way to do this, then I
would greatly appreciate any advise.

I might have run fsck with the -c option, which gets passed through to
e2fsck. This combines both operations - it runs badblocks internally,
applies anything it finds to the badblocks list and then (presumably)
checks the filing structure.


Martin
jjmckenzie
Moderator
Moderator
Posts: 1153
Joined: Wed Apr 27, 2011 11:01 pm

Very Strange and Probably Obscure Problem

Post by jjmckenzie »

On Wed, Apr 18, 2012 at 10:03 AM, eliwap <[email protected]> wrote:
k... smartd did indeed report a problem. Drat.
Time to buy a new hard drive. With 'modern' equipment the failure
will be quick and very painful if you try to recover from it.

(I've learned from experience, backups are a necessary evil as well.)

James
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

backups have been done. As a matter of fact the problem is having to backup the backups lol which have also been done.
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Re: Very Strange and Probably Obscure Problem

Post by eliwap »

[/quote]Shouldn't that be:

fsck -l bad-blocks /dev/mapper/vg_computer-lv_root

I think your shell redirection would clear the badblocks list in
preparation for writing to it, so you'd empty the file before fsck could
apply it.

If I am making a mistake, or there is a better way to do this, then I
would greatly appreciate any advise.

I might have run fsck with the -c option, which gets passed through to
e2fsck. This combines both operations - it runs badblocks internally,
applies anything it finds to the badblocks list and then (presumably)
checks the filing structure.


Martin[/quote]

That was a typo. Ok... thanks for the tip about -c I will definitely do that. I interrupted the badblock run anyway because I needed to back up stuff.
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

OK... I rebooted the computer and it came up. Yay!!!.

I ran the game and it started without a problem. Yay!!!

Thanks Martin and dimesio for your advise. Stay tuned I will let you guys know after a few updates if things are still working the way they're supposed to.
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

Drat

I just run yum update. Rebooted and had to reinstall the nvidia 32bit libraries again.

Without a doubt, the maintenance on my hard drives was good for my computer, but it did not solve the problem.

So, I'm asking.... next???

I really do appreciate all the help.


Eli
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

Bummer. I probably do have to replace the hard drive

Checked smartd again. The service was disabled even though I explicitly enabled it. And once I brought it online again, I immediately got

WARNING: Your hard drive is failing
Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors

That was the same error I got before I ran the maintenance procedure.

Time to burn a clonezilla cd

arrrgh.

Eli
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

Oh.... I was thinking. Could a problem like this arise from a corrupted journal.

Would it be wise to try to disable the journal (assuming that it would be removed) as per http://fenidik.blogspot.com/2010/03/ext ... urnal.html Then enable the journal (assuming that i would be recreated)?

Is there an intermediary step necessary to delete the journal?

Thanks for any input.

Eli
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Thu, 2012-04-19 at 00:40 -0500, eliwap wrote:
Bummer. I probably do have to replace the hard drive
Well, dimesio said that was likely. I sling a drive as soon as smartd
says its going sour.
Checked smartd again. The service was disabled even though I explicitly enabled it. And once I brought it online again, I immediately got
Did you mark it to start on booting as will as starting it? Those are
tow separate actions for both the daemon/service mangement systems. On
the old SysV init you need to use both chkconfig and service
respectively. With systemd systemctl does both jobs.

WARNING: Your hard drive is failing
Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors

That was the same error I got before I ran the maintenance procedure.

Time to burn a clonezilla cd
If you made a backup before the last repair, use that. Otherwise running
fsck before trying to make another backup is a good idea.

Backup hint: Long ago I moved the contents of /usr/local and the
contents of directories associated with Java, PostgreSQL and Apache to
directories in /home and replaced the original directories with
symlinks. This means I only need to back up /home and keep copies of
files in /etc that I've manually altered. This makes backups fast and
easy (and even faster if you backup to a USB disk or two with rsync).
I'm a belt and braces man when it comes to backups: I have an automatic
overnight backup and do a manual one immediately before my weekly "yum
upgrade" run. A side benefit is that this makes distro version upgrades
a lot easier.

Full details here if you want to try it:
http://www.libelle-systems.com/free/lin ... rades.html

Martin
User avatar
dimesio
Moderator
Moderator
Posts: 13207
Joined: Tue Mar 25, 2008 10:30 pm

Post by dimesio »

eliwap wrote:Oh.... I was thinking. Could a problem like this arise from a corrupted journal.
If smartctl says your hard drive is failing, it's failing. Modern hard drives have a reserved area to remap bad sectors to, and they do it automatically, without telling the user. When you reach the point that tools like smartctl report uncorrectable sectors, that means that the drive has already used up all the reserved space. You have a lot more bad sectors on that drive than the two smartctl is reporting.
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Thu, 2012-04-19 at 07:13 -0500, dimesio wrote:
eliwap wrote:
Oh.... I was thinking. Could a problem like this arise from a corrupted journal.
If smartctl says your hard drive is failing, it's failing. Modern hard
drives have a reserved area to remap bad sectors to, and they do it
automatically, without telling the user. When you reach the point
that tools like smartctl report uncorrectable sectors, that means that
the drive has already used up all the reserved space. You have a lot
more bad sectors on that drive than the two smartctl is reporting.
FWIW, you can get a more information, including the size of the sector
reallocation table and the reallocation count by running
"systemctl --xall /dev/sda" instead of "systemctl --all /dev/sda"


Martin
lmn40227
Newbie
Newbie
Posts: 3
Joined: Mon Apr 16, 2012 8:00 pm

Post by lmn40227 »

If I am making a mistake, or there is a better way to do this, then I
would greatly appreciate any advise.
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Fri, 2012-04-20 at 11:59 -0500, lmn40227 wrote:
If I am making a mistake, or there is a better way to do this, then I
would greatly appreciate any advise.
Depends which filing system you're using. ext3 probably is adversely
affected by bad blocks in the journal but ext4, which checksums journal
blocks, should detect them. Personally, I wouldn't try disabling
journalling. In any case, its only makes a difference if there were
files open and being written to at the time of the crash. If you've
successfully booted since then, the journal will have been used during
start-up so disabling it becomes somewhat moot. That said, the easiest
way of disabling it is to change the partition entries in /etc/fstab to
ext2.

At present its a case of 'you pays your money and takes your choice'
between the two because each has minor gotchas not shared by the other.
My boxes currently use ext2 for /boot and a mix of ext3 and ext4 for the
other partitions (I keep /home in a separate partition for faster
upgrades and also have an encrypted partition). I'm not planning to
change this until Fedora 18, when I'll think about moving to btrs.


Martin
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

Thanks Martin for all your great advise. I ran smartctl -xall /dev/sdb. It came back with a whole lot of info that I can't make heads or tails of. It said that there were some kind of errors.

I was wondering if I can post them here and you could take a look at the output?

Eli
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Tue, 2012-04-24 at 12:30 -0500, eliwap wrote:
Thanks Martin for all your great advise. I ran smartctl -xall /dev/sdb. It came back with a whole lot of info that I can't make heads or tails of. It said that there were some kind of errors.

I was wondering if I can post them here and you could take a look at the output?
That's OK by me, and I hope the rest of the list is also happy.

IMO everybody should run smartd as a matter of course to get warning
when a disk is going sour. If you havent already done this, set your
mail alias so the local MTA (usually sendmail) will redirects all mail
sent to root to your usual login - the one where you pick up your
e-mail. By doing this you'll see the smartd short status reports, which
should be part of the daily logwatch report. I also configure smartd to
produce a more detailed report, like the one you requested via smartctl,
on a weekly basis.

As the full report is quite long, put it in pastebin and post the URL
here. I make no guarantees about spotting the problem, though.


Martin
jjmckenzie
Moderator
Moderator
Posts: 1153
Joined: Wed Apr 27, 2011 11:01 pm

Very Strange and Probably Obscure Problem

Post by jjmckenzie »

On Tue, Apr 24, 2012 at 10:30 AM, eliwap <[email protected]> wrote:
Thanks Martin for all your great advise. I ran smartctl -xall /dev/sdb. It came back with a whole lot of info that I can't make heads or tails of. It
said that there were some kind of errors.
It really and seriously looks like it is time to replace that drive.
If it is within warranty time, get it replaced. Backup any essential
files to a different drive, CD or DVD.

BTW, I run a continuous backup system and the drive for it reported
errors. I ran a program to inspect repair the drive and all is well
now. However, if the drive reports errors again, it is going to be
replaced.

James
eliwap
Level 2
Level 2
Posts: 17
Joined: Thu Jun 25, 2009 3:44 pm

Post by eliwap »

James.... I agree with your assessment. But its also a good opportunity to learn a little something new :)

Martin... Thanks again for yuour help. The output follows:

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.3.2-6.fc16.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green
Device Model: WDC WD10EADS-00M2B0
Serial Number: WD-WCAV54491480
LU WWN Device Id: 5 0014ee 258ffb6bd
Firmware Version: 01.00A01
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Apr 25 07:22:47 2012 IDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (20760) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 239) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 199 156 051 - 68440
3 Spin_Up_Time POS--K 122 112 021 - 6891
4 Start_Stop_Count -O--CK 100 100 000 - 118
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 089 089 000 - 8509
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 118
192 Power-Off_Retract_Count -O--CK 200 200 000 - 111
193 Load_Cycle_Count -O--CK 001 001 000 - 932987
194 Temperature_Celsius -O---K 111 101 000 - 36
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 5 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 6 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 1 sectors [Extended self-test log]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error log]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xa0 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa1 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa2 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa3 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa4 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa5 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa6 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa7 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa8 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xa9 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaa has 1 sectors [Device vendor specific log]
GP/S Log at address 0xab has 1 sectors [Device vendor specific log]
GP/S Log at address 0xac has 1 sectors [Device vendor specific log]
GP/S Log at address 0xad has 1 sectors [Device vendor specific log]
GP/S Log at address 0xae has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaf has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb0 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb1 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb2 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb3 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb4 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb5 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb6 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb7 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xc0 has 1 sectors [Device vendor specific log]
GP Log at address 0xc1 has 24 sectors [Device vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3011 (device log contains only the most recent 24 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3011 [2] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:48.924 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 09:49:48.923 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 09:49:48.923 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 09:49:48.922 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 09:49:48.922 SET FEATURES [Set transfer mode]

Error 3010 [1] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:46.304 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 09:49:46.304 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 09:49:46.304 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 09:49:46.302 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 09:49:46.302 SET FEATURES [Set transfer mode]

Error 3009 [0] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:43.685 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 09:49:43.685 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 09:49:43.685 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 09:49:43.683 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 09:49:43.683 SET FEATURES [Set transfer mode]

Error 3008 [23] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:41.066 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 09:49:41.066 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 09:49:41.066 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 09:49:41.063 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 09:49:41.063 SET FEATURES [Set transfer mode]

Error 3007 [22] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:38.446 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 09:49:38.446 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 09:49:38.446 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 09:49:38.444 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 09:49:38.444 SET FEATURES [Set transfer mode]

Error 3006 [21] occurred at disk power-on lifetime: 8364 hours (348 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 3d 17 00 b3 17 40 08 09:49:35.839 READ FPDMA QUEUED
60 00 08 00 00 00 3d 15 00 b3 7f 40 08 09:49:35.822 READ FPDMA QUEUED
60 01 00 00 08 00 09 08 00 41 a7 40 08 09:49:35.821 READ FPDMA QUEUED
60 01 00 00 00 00 09 07 00 41 a7 40 08 09:49:35.819 READ FPDMA QUEUED
60 01 00 00 08 00 09 06 00 41 a7 40 08 09:49:35.818 READ FPDMA QUEUED

Error 3005 [20] occurred at disk power-on lifetime: 8359 hours (348 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 01 00 00 08 00 3d 17 00 b3 8f 40 08 04:31:12.286 READ FPDMA QUEUED
60 00 08 00 00 00 3d 17 00 b3 17 40 08 04:31:12.286 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 04:31:12.286 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 04:31:12.286 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 04:31:12.283 IDENTIFY DEVICE

Error 3004 [19] occurred at disk power-on lifetime: 8359 hours (348 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 3d b3 17 1a 40 00 Error: UNC at LBA = 0x3db3171a = 1035147034

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 08 00 3d 17 00 b3 17 40 08 04:31:09.666 READ FPDMA QUEUED
60 01 00 00 00 00 3d 17 00 b3 8f 40 08 04:31:09.666 READ FPDMA QUEUED
ef 00 10 00 02 00 00 00 00 00 00 a0 08 04:31:09.666 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 00 00 00 00 00 e0 08 04:31:09.666 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 04:31:09.664 IDENTIFY DEVICE

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 2
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: SMART Off-line Data Collection executing in background (4)
Current Temperature: 36 Celsius
Power Cycle Min/Max Temperature: 35/38 Celsius
Lifetime Min/Max Temperature: 35/46 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (6)

Index Estimated Time Temperature Celsius
7 2012-04-24 23:25 36 *****************
... ..( 21 skipped). .. *****************
29 2012-04-24 23:47 36 *****************
30 2012-04-24 23:48 35 ****************
31 2012-04-24 23:49 36 *****************
... ..( 26 skipped). .. *****************
58 2012-04-25 00:16 36 *****************
59 2012-04-25 00:17 35 ****************
60 2012-04-25 00:18 36 *****************
... ..( 51 skipped). .. *****************
112 2012-04-25 01:10 36 *****************
113 2012-04-25 01:11 37 ******************
... ..( 3 skipped). .. ******************
117 2012-04-25 01:15 37 ******************
118 2012-04-25 01:16 36 *****************
119 2012-04-25 01:17 37 ******************
... ..( 8 skipped). .. ******************
128 2012-04-25 01:26 37 ******************
129 2012-04-25 01:27 36 *****************
... ..( 69 skipped). .. *****************
199 2012-04-25 02:37 36 *****************
200 2012-04-25 02:38 35 ****************
201 2012-04-25 02:39 36 *****************
... ..( 14 skipped). .. *****************
216 2012-04-25 02:54 36 *****************
217 2012-04-25 02:55 35 ****************
218 2012-04-25 02:56 36 *****************
... ..( 18 skipped). .. *****************
237 2012-04-25 03:15 36 *****************
238 2012-04-25 03:16 35 ****************
239 2012-04-25 03:17 36 *****************
... ..( 44 skipped). .. *****************
284 2012-04-25 04:02 36 *****************
285 2012-04-25 04:03 35 ****************
286 2012-04-25 04:04 36 *****************
... ..( 25 skipped). .. *****************
312 2012-04-25 04:30 36 *****************
313 2012-04-25 04:31 35 ****************
314 2012-04-25 04:32 36 *****************
... ..(141 skipped). .. *****************
456 2012-04-25 06:54 36 *****************
457 2012-04-25 06:55 35 ****************
458 2012-04-25 06:56 36 *****************
... ..( 17 skipped). .. *****************
476 2012-04-25 07:14 36 *****************
477 2012-04-25 07:15 35 ****************
0 2012-04-25 07:16 36 *****************
... ..( 5 skipped). .. *****************
6 2012-04-25 07:22 36 *****************

Warning: device does not support SCT Error Recovery Control command
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 10 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 519997 Vendor specific
Martin Gregorie

Very Strange and Probably Obscure Problem

Post by Martin Gregorie »

On Tue, 2012-04-24 at 23:26 -0500, eliwap wrote:
James.... I agree with your assessment. But its also a good opportunity to learn a little something new :)

Martin... Thanks again for yuour help. The output follows:
What I see are a load of raw read errors plus a smaller number of spin
up errors in the statistics table. The diagnosed errors are mostly
failures to set a required transfer mode and a failure to queue a read
request which is could be a queue overflow due to a failed and retrying
read request.

I'd replace that disk ASAP.

If its less than three years old its under WD guarantee. If you know
where you got it the dealer may handle the guarantee claim. Otherwise,
contact WD directly.

I've had a number of WD 3.5" drives fail, but always after several years
use. However, last year I had a string of 2.5" Caviar Blue drives fail
after less than 10 hours operation. The first two were replaced through
the dealer and failed almost immediately. The third went back direct to
WD and seems to be working OK though I haven't used it much.

BTW, if you're using any sort of mirroring (RAID 1 or RAID 5 and their
equivalents) its a good idea to make sure that the disks in the set are
*NOT* all from the same batch and, if you're sufficiently mistrustful,
not all from the same manufacturer either though, of course, they should
all be the same size.


Martin
Locked