Common Error Messages

From FaHWiki
Jump to: navigation, search

Contents

Many of these common errors relate to Core Status Codes, which can be found here.

EARLY_UNIT_END

Quite possibly the most common error found today. EARLY_UNIT_END is usually caused by one of two things: a bad WU or an unstable system.

A certain percentage of WUs will reach an EUE spontaneously. There is no way to predict when this will happen (except to run them) but they can be managed. The Pande Group generally keeps the percentage quite low, and in all cases, below 5% of the WUs. There is no precise way for you to tell if the EUE was due to a "bad WU" or a hardware error. Multiple EUEs generally indicate a hardware problem. An occasional one should be ignored.
After an EUE is returned to the server, it will normally be assigned to another computer. The Pande Group and the site moderators can tell if others have had a similar EUE on that WU or if it was completed successfully.

For a more in depth explanation see Early Unit End, and EUE Types.

Note: See the description about "-forceasm" (3.x) or "-forceSSE" (4.x) causing SPECIAL_EXIT on certain AMD based systems. If you are running an AMD Athlon XP with the Thoroughbred or Barton cores, you should remove the "-forceasm" or "-forceSSE" switch, most likely fixing your problems.


Couldn't send HTTP request to server (wininet)

The most common cause of this message is when the fah client configuration for "Use IE Settings" is set to yes. Prior to when Internet Explorer v7 was released, a yes setting was not an issue, but Microsoft released security patches to its operating systems and browsers that now cause a connnection error.

  • In the console client, run with the -configonly switch, and change the setting to No. For more detailed instructions, see the entry on how to Reconfigure the Console client.
  • In the GUI client, right click on the FAH tool tray icon and select configure. Select the Connection Tab. Uncheck the box for Use IE Settings. Also uncheck the box for Proxy, unless you are specifically using a proxy connection. Click OK. Right click the icon again and select Quit. Wait 2 minutes for the client to shut down, and then start it again using the shortcut in the Startup folder under Start/Programs. For more detailed instructions, see the entry on how to Reconfigure the GUI client.

This is an example of the error message from a fahlog.txt file:

+ Attempting to send results
Couldn't send HTTP request to server (wininet)
+ Could not connect to Work Server (results)
(171.xx.xx.xx:xxxx)
- Error: Could not transmit unit 02 (completed April 6) to work server.

Note: This setting was removed in the v6 client to avoid the problem going forward.


Users who normally leave Internet Explorer in offline mode (Maybe due to security concerns) may also experience problems when attempting to upload WUs. To rectify this issue, set Internet Explorer to online mode (This can be done from the File menu - Either in IE itself, or from certain other WinInet applcations such as Windows Media Player) before a WU is due to be sent. Setting the Ask before fetching/sending work option to yes may make the process a little easier for users who'd prefer to keep full control of Internet Explorer's access to the Internet, but might also slow down WU turnaround to a degree.

FILE_IO_ERROR

An error that occurs when disk operations go bad. This is a fairly general error, having many sub-types. It has plummeted in frequency since the release of Gromacs Core 1.46. Now, this error usually happens when a hardware error occurs: something like "Write 0010, read back 0011". If you experience this error, make sure your hard drives are OK: run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make sure you're using good IDE cables, and make sure the drive isn't dying.

FILE_IO_ERROR has also been reported to occur if two Console clients working on the same unit are started. This can occur if you accidentally start one client twice on a dually, instead of two clients once.

FILE_IO_ERROR has also been reported with certain anti-virus software. See http://foldingforum.org/viewtopic.php?f=8&t=1688#p14096

CLIENT_DIED

This happens when, simply enough, the client dies. The core is still running, and can't find the client, so it shuts down. This is usually related to overclocking and/or overly aggressive memory timings. Back down on these and this error should vanish.


UNKNOWN ERROR

A now rare Gromacs error that usually occurs if there's a corrupt WU being processed. It is no longer common and any instances should probably be reported (post a log, etc.). You may also want to check your hardware if you've had past errors.

Client-Core Communications Error

Descriptions and explanations of most Client-Core Communication Error messages can be found here: CoreStatus codes


BAD_FRAME_CHECKSUM

You'll see a block in your log that looks something like this:

[hh:mm:ss] Header on frame 220 differs from expected header
[hh:mm:ss] Got: A028B-5C-3E84B02E-EA1B7D4: 0220
[hh:mm:ss] Expected: A028B-5C-3E84B02E-EA1B7D4: 0219

Note that the two lines of Hexadecimal numerals are the same. This strange error only occurs with Tinker units. The only known cause is when two or more clients are started at once and are working in the same directory, but there may be other causes. This error often, bizzarely, occurs on an early frame but is not detected until the unit's end.

BAD_FRAME_CHECKSUM, similar to one type of Gromacs FILE_IO_ERROR, can also mean that a hardware error occurred where there was a slight discrepancy between what was read and what was expected: something like writing 101010 and reading back 110110. Again, this is commonly not detected until the unit finishes.


SPECIAL_EXIT

This server error means that something unknown happened inside the Gromacs core. The only known cause is when "-forceasm" (3.x) or "-forceSSE" (4.x) is applied to an AMD system that is not 100% stable with SSE. CPUs that had problems include the Thoroughbred B, Barton, and Opteron cored processors. In this case it should be dealt with as an EARLY_UNIT_END error (see above). Removing "-forceasm" or "-forceSSE" will almost certainly fix the problem. SSE related errors are now fairly rare, compared to a few years ago.

If you are not forcing use of SSE and this error occurs, a log should be posted in the Folding-Community forum, as this is a serious problem.

Previous termination of core was improper

This is more of a status message than an error message, but it is often viewed as a problem, so it will be added here. The message usually appears in the fahlog like this, with time stamps preceeding each line:

Preparing to commence simulation 
- Ensuring status. Please wait. 
- Looking at optimizations... 
- Working with standard loops on this execution. 
- Previous termination of core was improper. 

The most common symptom of this message, other than this message in the log, is the client folding much slower than normal. Without SSE optimizations, each percentage complete takes 2-3 times longer.

The most common cause of this message is when a client was not shut down gracefully (quit or ctrl+c), often occuring after a computer reset or hard boot or power outage. Restarting the client should correct the problem. Adding the -forceasm client switch is another option to prevent this from happening again.

Note: This status message is also seen in the SMP client log, but is only cosmetic as the SMP client has -forceasm hard-coded on and does not suffer the typical slowdown when not running client optimizations like the CPU client. Again, strictly a cosmetic issue and does not affect SMP client performance.


Server reports digital signature does not match

Some of the newer servers don't seem to like the older versions of the client. Upgrade to the latest client. In addition to this, a corrupted queue.dat file can cause this error to be reported. Running qfix may help resolve this issue. If you are running the latest client, and qfix does not rectify the issue, report the error on the folding-community forums and delete the WU.


Server does not have record of this unit

This is also more of a status message than an error message, but it is often viewed as a problem, so it will be added here as well. The message usually appears in the fahlog like this, with time stamps preceding each line:

+ Attempting to send results 
- Server does not have record of this unit. Will try again later. 
  Could not transmit unit XX to Collection server; keeping in queue.

The more common cause of this message is when a Work Server goes down or off line unexpectedly, and does not have a chance to update the list of outstanding work units on the Collection Server. Then when the client can't upload to the Work Server (it's off line), it will attempt to connect to the Collection Server. The list of outstanding work units may be incomplete so if there is no record of your WU, the CS won't accept the upload as a security precaution. The completed WU will automatically try to upload every 6 hours, so this message may appear in the log many times. When the Work Server comes back online, it will update the list on the Collection Server. The completed WU will upload to either the Work Server or Collection Server, whichever the client can connect to first.

A lesser cause of this message happens when a F@h client uploads a completed work unit, the Work or Collection Server accepts and receives the WU, but the acknowledgment sent back to the client is not received by the client due to some problem with the internet connection. The client will think the upload was not completed, and will attempt to upload the WU again. But because the project server already received the WU, the server takes that WU off the list of outstanding WUs and will not accept the WU again. After you have verified receiving credit for that specific WU, it can be deleted from the client queue. See the -delete xx client switch.

A third possibility is that the WU has expired. Once a WU expires, the result will no longer be accepted for credit, and in fact, the server discards its record that you have the WU so it will not be accepted at all. Supposedly the client will delete the WU when it expires but that feature doesn't always work correctly and you may have manually disabled this feature in the client configuration (Local clocks).


Server reports problem with unit

The work unit was uploaded to the Work Server and failed the data integrity checking. There can be several causes. The most common cause is too much overclocking causing WU data corruption. Less common problems include a faulty network connection, network cable, etc., or even a bad work unit. The only way to know if the WU is bad is to post the WU data on the Folding Forum and ask a Mod to check if another donor was able to complete the WU or not. And with the increased security measures, if the WU was downloaded on one computer, and then completed and uploaded on another computer, and the IDs do not match, the work unit might be discarded.

The message usually appears in the fahlog like this, with time stamps preceding each line:

+ Attempting to send results [August 13 14:15:16 UTC]
- Reading file work/wuresults_01.dat from core
  (Read 98765432 bytes from disk)
Connecting to http://123.123.123.123:8080/
Posted data.
Initial: 0000; - Uploaded at ~69 kB/s
- Averaged speed for that direction ~69 kB/s
- Server reports problem with unit.


Warning: long 1-4 interactions

This warning message appears in the fahlog.txt file. When this message appears by itself, it is more of a status message than an error. However, this warning message is often followed by another error message. That second error message is more indicative of the problem. Please search for that second error message for more information.


Read packet limit of 540015616... Set to 524286976

This status message is commonly mistaken for an error message. The client is actually documenting it has been configured for Big Work Units and the WU size's upper limit has been set at 500 MB (524286976 bytes). If there is an actual error, additional lines will be written after this message in the fahlog.txt.

[10:07:51] - Read packet limit of 540015616... Set to 524286976.


'mpiexec' is not recognized

This status message is displayed in the fahlog by the v6.30 SMP client. The requirement to run 3rd party MPICH software (mpiexe) for multi-threading was removed in v6.30 with the addition of the new fahcore_a3.exe. However, some of the hooks in the client to MPICH software were not removed.

When you upgrade from v6.29 to v6.30, this message is not displayed, unless MPICH was uninstalled. New installations of v6.30 will display this status message. It is purely cosmetic, and does not affect the performance of the client or FahCore.

'mpiexec' is not recognized as an internal or external command,
operable program or batch file.


Error: Bad packet type from server

This error message is displayed in the fahlog by the v6.xx client. This may happen when the Assignment Server (AS) thinks the Work Server (WS) has work to assign, but the WS has run out of work units by the time the client sends the download request.

Initial: 0000; - Error: Bad packet type from server, expected work assignment
- Attempt #1  to get work failed, and no other work to do.

One solutions is patience. The work server will have more work units available soon. Another more immediate solution is to change client options (- switches) to change the WU types requested by the client. For example, remove the beta or advanced methods option to request assignments of all types of work units. If more work is not available within a few hours, please check the Announcement section of the folding forum for any maintenece or outage posts. Please also check the FAH NEWS page. If nothing is posted, please start a new topic to report the problem in the Issues with a specific server section (assuming one does not already exist there either).


Common V7 Client Errors

Just starting to collect these. Will add answers and errors as available.

Exception: Failed to send results to work server: 10002: Received short response, expected 512 bytes, got 0
ERROR:exception: Bad platformId size.
Exception: Failed to send results to work server: Failed to connect to xxx.xxx.xxx.xxx
Failed to get assignment from 'assign4.stanford.edu:80': Empty work server assignment
Exception: Could not get an assignment
Exception: Option 'gpu-index' has no default and is not set.
exception: Error invoking kernel execFFT: clEnqueueNDRangeKernel (-55)
Exception: Failed to access core package.
ERROR:exception: Max Retries Reached
Exception: Failed to wait on process 8361:No child processes
Exception: Failed to send results to work server: Failed to connect to 171.64.65.98:80: Connection refused
Failed to send results, will try again later
Exception: Failed to send results to work server: Transfer failed
Exception: 10001: Server responded: HTTP_FORBIDDEN
ERROR:exception: clSetKernelArg




Reference Links

Personal tools