Common Error Messages

From FaHWiki

Jump to: navigation, search

Contents

Many of these common errors relate to Core Status Codes, which can be found here.

EARLY_UNIT_END

Quite possibly the most common error found today. EARLY_UNIT_END is usually caused by one of two things: a bad WU or an unstable system.

A certain percentage of WUs will reach an EUE spontaneously. There is no way to predict when this will happen (except to run them) but they can be managed. The Pande Group generally keeps the percentage quite low, and in all cases, below 5% of the WUs. There is no precise way for you to tell if the EUE was due to a "bad WU" or a hardware error. Multiple EUEs generally indicate a hardware problem. An occasional one should be ignored.
After an EUE is returned to the server, it will normally be assigned to another computer. The Pande Group and the site moderators can tell if others have had a similar EUE on that WU or if it was completed successfully.

For a more in depth explanation see Early Unit End, and EUE Types.

Note: See the description about "-forceasm" (3.x) or "-forceSSE" (4.x) causing SPECIAL_EXIT on certain AMD based systems. If you are running an AMD Athlon XP with the Thoroughbred or Barton cores, you should remove the "-forceasm" or "-forceSSE" switch, most likely fixing your problems.


Couldn't send HTTP request to server (wininet)

The most common cause of this message is when the fah client configuration for "Use IE Settings" is set to yes. Prior to when Internet Explorer v7 was released, a yes setting was not an issue, but Microsoft released security patches to its operating systems and browsers that now cause a connnection error.

  • In the console client, run with the -configonly switch, and change the setting to No. For more detailed instructions, see the entry on how to Reconfigure the Console client.
  • In the GUI client, right click on the FAH tool tray icon and select configure. Select the Connection Tab. Uncheck the box for Use IE Settings. Also uncheck the box for Proxy, unless you are specifically using a proxy connection. Click OK. Right click the icon again and select Quit. Wait 2 minutes for the client to shut down, and then start it again using the shortcut in the Startup folder under Start/Programs. For more detailed instructions, see the entry on how to Reconfigure the GUI client.

This is an example of the error message from a fahlog.txt file:

+ Attempting to send results
Couldn't send HTTP request to server (wininet)
+ Could not connect to Work Server (results)
(171.xx.xx.xx:xxxx)
- Error: Could not transmit unit 02 (completed April 6) to work server.

Note: This setting was removed in the v6 client to avoid the problem going forward.


Users who normally leave Internet Explorer in offline mode (Maybe due to security concerns) may also experience problems when attempting to upload WUs. To rectify this issue, set Internet Explorer to online mode (This can be done from the File menu - Either in IE itself, or from certain other WinInet applcations such as Windows Media Player) before a WU is due to be sent. Setting the Ask before fetching/sending work option to yes may make the process a little easier for users who'd prefer to keep full control of Internet Explorer's access to the Internet, but might also slow down WU turnaround to a degree.

FILE_IO_ERROR

An error that occurs when disk operations go bad. This is a fairly general error, having many sub-types. It has plummeted in frequency since the release of Gromacs Core 1.46. Now, this error usually happens when a hardware error occurs: something like "Write 0010, read back 0011". If you experience this error, make sure your hard drives are OK: run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make sure you're using good IDE cables, and make sure the drive isn't dying.

FILE_IO_ERROR has also been reported to occur if two Console clients working on the same unit are started. This can occur if you accidentally start one client twice on a dually, instead of two clients once.

FILE_IO_ERROR has also been reported with certain anti-virus software. See http://foldingforum.org/viewtopic.php?f=8&t=1688#p14096

CLIENT_DIED

This happens when, simply enough, the client dies. The core is still running, and can't find the client, so it shuts down. This is usually related to overclocking and/or overly aggressive memory timings. Back down on these and this error should vanish.


UNKNOWN ERROR

A now rare Gromacs error that usually occurs if there's a corrupt WU being processed. It is no longer common and any instances should probably be reported (post a log, etc.). You may also want to check your hardware if you've had past errors.

Client-Core Communications Error

Descriptions and explanations of most Client-Core Communication Error messages can be found here: CoreStatus codes


BAD_FRAME_CHECKSUM

You'll see a block in your log that looks something like this:

[hh:mm:ss] Header on frame 220 differs from expected header
[hh:mm:ss] Got: A028B-5C-3E84B02E-EA1B7D4: 0220
[hh:mm:ss] Expected: A028B-5C-3E84B02E-EA1B7D4: 0219

Note that the two lines of Hexadecimal numerals are the same. This strange error only occurs with Tinker units. The only known cause is when two or more clients are started at once and are working in the same directory, but there may be other causes. This error often, bizzarely, occurs on an early frame but is not detected until the unit's end.

BAD_FRAME_CHECKSUM, similar to one type of Gromacs FILE_IO_ERROR, can also mean that a hardware error occurred where there was a slight discrepancy between what was read and what was expected: something like writing 101010 and reading back 110110. Again, this is commonly not detected until the unit finishes.


Server reports digital signature does not match

Some of the newer servers don't seem to like the older versions of the client. Upgrade to the latest client. In addition to this, a corrupted queue.dat file can cause this error to be reported. Running qfix may help resolve this issue. If you are running the latest client, and qfix does not rectify the issue, report the error on the folding-community forums and delete the WU,


SPECIAL_EXIT

This server error means that something unknown happened inside the Gromacs core. The only known cause is when "-forceasm" (3.x) or "-forceSSE" (4.x) is applied to an AMD system that is not 100% stable with SSE. CPUs that had problems include the Thoroughbred B, Barton, and Opteron cored processors. In this case it should be dealt with as an EARLY_UNIT_END error (see above). Removing "-forceasm" or "-forceSSE" will almost certainly fix the problem. SSE related errors are now fairly rare, compared to a few years ago.

If you are not forcing use of SSE and this error occurs, a log should be posted in the Folding-Community forum, as this is a serious problem.

Previous termination of core was improper

This is more of a status message than an error message, but it is often viewed as a problem, so it will be added here. The message usually appears in the fahlog like this, with time stamps preceeding each line:

Preparing to commence simulation 
- Ensuring status. Please wait. 
- Looking at optimizations... 
- Working with standard loops on this execution. 
- Previous termination of core was improper. 

The most common symptom of this message, other than this message in the log, is the client folding much slower than normal. Without SSE optimizations, each percentage complete takes 2-3 times longer.

The most common cause of this message is when a client was not shut down gracefully (quit or ctrl+c), often occuring after a computer reset or hard boot or power outage. Restarting the client should correct the problem. Adding the -forceasm client switch is another option to prevent this from happening again.

Server does not have record of this unit

This is also more of a status message than an error message, but it is often viewed as a problem, so it will be added here as well. The message usually appears in the fahlog like this, with time stamps preceeding each line:

+ Attempting to send results 
- Server does not have record of this unit. Will try again later. 
  Could not transmit unit XX to Collection server; keeping in queue.

The more common cause of this message is when a Work Server goes down or offline unexpectedly, and does not have a chance to update the list of outstanding work units on the Collection Server. Then when the client can't upload to the Work Server (it's offline), it will attempt to connect to the Colletion Server. The list of outstanding work units may be incomplete so if there is no record of your WU, the CS won't accept the upload as a security precaution. The completed WU will automatically try to upload every 6 hours, so this message may appear in the log many times. When the Work Server comes back online, it will update the list on the Collection Server. The completed WU will upload to either the Work Server or Collection Server, whichever the client can connect to first.

A lessor cause of this message happens when a F@h client uploads a completed work unit, the Work or Collection Server accepts and receives the WU, but the acknowledgement sent back to the client is not received by the client due to some problem with the internect connection. The client will think the upload was not completed, and will attempt to upload the WU again. But because the project server already received the WU, the server takes that WU off the list of outstanding WUs and will not accept the WU again. After you have verified receiving credit for that specific WU, it can be deleted from the client queue. See the -delete xx client switch.

Warning: long 1-4 interactions

This warning message appears in the fahlog.txt file. When this message appears by itself, it is more of a status message than an error. However, this warning message is often followed by another error message. That second error message is more indicative of the problem. Please search for that second error message for more information.

Reference Links

Personal tools