FAH & SMP

General
The current Folding@Home client version 5 (up to and including v5.04) does not have the ability to manage multiple CPUs from a single client. Presently the client is aware of only 1 physical CPU. SMP functionality is currently being beta tested in v5.91 of the F@H client. See also: SMP client, Folding@Home high performance client FAQ and FAH & Clusters

It is however, possible to make use of up to a maximum of 8 CPUs (it is possible to increase this number further by using tricks to isolate the UserID, but out-of-the-box, only 8 can be used) in a single system with the version 5 client. In other words, your Dual CPU or 8-way Opteron system can be used to its full potential. You can do so by running multiple instances of the client configured with their own unique Machine ID, one client for each physical CPU installed, or one client for each core in the case of one or more Multi-Core CPUs.

Note: As of November 2006, this is changing. A beta version of a SMP client is being made available. This will support 4-way processing (typically containing two CPUs, each with 2 independent cores) on Linux, and Mac OS X. Eight-way hardware can be supported with two clients. A SMP client for Windows beta was added 17-Mar-2007. However, it is an open question whether running a single SMP installation has any advantages over running multiple independent installations.

The FAH SMP client is set up to use multiple CPUs as if they are multiple cluster nodes attached to localhost. If the current Beta SMP client or any future releases of it will let you to set up a real cluster with computing nodes attached to an external network, it is a different topic.

See:

See:

Hyper-Threading
Although being advertised as a virtual second CPU, a Hyper-Threading (HT) enabled CPU is not equivalent to a dual-CPU setup with regard to.

In simple terms Hyper-Threading allows unused instruction units of the CPU to be utilized in parallel. Whilst this can be very effective for certain application classes, for it is not especially helpful. SMP aware OSes see a Hyper-Threaded CPU as two CPUs, when in actual fact there is one complete CPU + its "spare" instruction units.

Molecular Dynamics (MD) calculations use, almost exclusively, 1 type of instruction unit, which differs depending on the core. FPU/SSE for Gromacs & QMD and the FPU for Tinker & Amber. A few parts of the calculations are performed using the other instruction units, but they are minor compared with main SSE/FPU code.

If 1 instance of is running, there will be large segments of the CPU that are not being used, like the ALU etc. This provides the OS, and most other programs with the ability to run along-side  with little performance impact, since most "everyday" programs don't require the segments of the CPU that  is using.

However, if you try and run two instances of on the same Hyper-Threaded CPU, they will both be competing for the same instruction units. This essentially means that each instance shares its access to those instruction units. Overall a small performance gain is achieved due to the few calculations that can be run on the "spare" instruction units. However the end result is that the two WUs you are running complete in just under twice the length of time as a single WU would because of the afore mentioned instruction unit sharing. The recommended procedure is to run a single instance of the client per physical CPU, because this will finish a Work Unit (WU) in the shortest time. This is more valuable to the science of, since the data from processed WUs is used to generate new ones.

If you have an HT capable CPU, the best way of running is to install 1 instance of, and leave HT enabled. This way you will gain a slight performance advantage over disabling HT, since will be affected to a lesser degree by other programs running.

A significant disadvantage of enabling Hyper-Threading is it draws almost as much power as a complete second CPU. From CustomPC: ''Hyper-Threading also comes with a power consumption cost, allegedly drawing as many watts as a full second core when at full tilt. We've certainly seen the heat production increase considerably when pummeling a Hyper-Threaded processor with two Folding@home work units, and this is probably why the technology is currently relegated to performance enthusiast parts, such as the Pentium Extreme Edition 840, now that dual-core processors are becoming the norm.'' Source article: Processors Explained, Page 4

Note: Windows task manager will show the CPU usage whilst running 1 instance of on a HT enabled CPU to be only 50%. This is wrong. The single instance of is working at full capacity on the portions of the CPU that it utilizes.

Multiple Core CPUs
CPUs with more than one core, e.g. an AMD X2 CPU or an Intel Core Duo, are seen by the Operating System and by the client as separate CPUs, as apposed to HT enabled CPUs. Therefore utilizing more than one core of the CPU is accomplished by running multiple instances of the client. One for each CPU core.

Dual Core CPUs are broadly equivalent to two CPUs within one package.

Affinity
When an operating system detects more than one processor/CPU/core, it will enable multiprocessor support. Then if more than one thread is ready to be processed, it will assign the highest priority threads to the processors that are ready to work. This is a dynamic process which happens very rapidly without user intervention. The operating systems generally assume that all processors have equal capabilities. ("symmetric" multiprocessing, or SMP). Traditionally, FAH has been single threaded, meaning that it can use no more than one CPU at a time. Recently, some cores have been provided that are multi-threaded and can use more than one CPU. The traditional method of folding on more than one CPU has been installing multiple FAH clients, one per CPU. This article is written assuming that you are running multiple single-threaded clients.

You can manually adjust the operating system's assignment choices by setting Affinity. In most cases, this is unnecessary, but there is one exception: Intel's HyperThreading. The following text assumes Windows (NT/2000/XP/Vista/etc.), after the Windows section, the procedure to set affinity on Linux is discussed.

If you have a P4/Xeon processor, it simulates two virtual processors using HyperThreading, or HT. There really is only one core in the CPU so it is impossible to increase performance very much over that of the same CPU without HT. Small increases may be obtained by changing the order that the instructions are executed by working on another thread whenever processing of the first one is blocked.

When not to use affinity
If you have a single P4/Xeon, both virtual processors have the same capabilities so it doesn't matter which one FAH is assigned to.

If you have true multi-processor, either with separate CPUs in different sockets or with something like the Athlon X2 or Intel's Core Duo.

If you run one client per virtual processor or you run SMP software that uses all of your virtual processors, you don't need affinity.

When to use affinity
If you have two or more P4/Xeon CPUs you will have four or more simulated processors running HT, you may need to use Affinity, depending on how you choose to run FAH.

With the above in mind, it leaves very few cases where it is useful. Specifically, let's assume you have a dual Xeon with four virtual processors and you run two single-threaded FAH clients. There will be two FahCore_xx processes which use 100% of a processor and are constantly ready to run. Also you'll have a variety of windows services which use virtually no processing time plus whatever you run in the foreground. Because of these other tasks, the two FahCores will be assigned almost randomly to any two of the virtual processors. To get maximum FAH throughput, it's important that the two FahCores do not run on the same real CPU.

When Windows starts up, it first detects two real CPU and then recognizes that they simulate four virtual CPUs. Thus CPU0 and CPU2 represents one physical CPU and CPU1 and CPU3 represents the other physical CPU. The ideal setting therefore is to restrict one FahCore to CPU0 and CPU2 and restrict the other FahCore to CPU1 and CPU3. It doesn't matter which "half" of the real CPU is utilized by a FahCore, as long as both FahCore's are prevented from running on a single real CPU.

This can be set permanently with a program on the Windows resource disk (which few people have) called Imagecfg.exe. It can be set dynamically through the Task manager (taskmgr.exe). Start only one FAH client. Right-click on the FAH client (winFAH or FAH50x-Console) and select Set Affinity. Remove the check-marks from the odd numbered CPUs. Right click on FahCore_xx and repeat the process. Note the Process IDs (PIDs) of each task. Start the second FAH client. On the new client and new FahCore, remove the check-marks from the even numbered CPUs.

When the current WU finishes, that FahCore will terminate and, after downloading a new WU, another one will be started. It's not necessary to modify the affinity of the new FahCore process, since it will inherit the affinity of the client that starts it. In most cases, it is impossible to set the client affinity before the first FahCore starts so you have to set it for the first one.

Setting affinity on Linux
To make use of affinity on Linux to bind one or more F@H clients to a particular CPU you can use the taskset program (from the shedutils package). To bind two F@H clients to a particular CPU using taskset on a Dual Xeon with HyperThreading, start the clients with these commands: taskset -c 0 ./FAH504-Linux.exe -verbosity 9 taskset -c 2 ./FAH504-Linux.exe -verbosity 9
 * 1) First F@H Client
 * 1) Second F@H Client

The -c parameter tells taskset to which CPU the process and its children should bond. You can see the CPU numbers available in your system in the file /proc/cpuinfo.

The lines with "processor" indicate the CPU number which Linux refers to that specific CPU or Core. $ cat /proc/cpuinfo processor      : 0 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 2.66GHz stepping       : 5 cpu MHz        : 2673.094 cache size     : 512 KB fdiv_bug        : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips       : 5351.45 processor      : 1 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 2.66GHz stepping       : 5 cpu MHz        : 2673.094 cache size     : 512 KB fdiv_bug        : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips       : 5345.96 processor      : 2 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 2.66GHz stepping       : 5 cpu MHz        : 2673.094 cache size     : 512 KB fdiv_bug        : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips       : 5345.86 processor      : 3 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 2.66GHz stepping       : 5 cpu MHz        : 2673.094 cache size     : 512 KB fdiv_bug        : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips       : 5345.92 For more information on using affinity on Linux, see:

Running Multiple Clients
When running multiple clients to utilize more than one CPU and/or CPU core, it is important to run each client from its own directory which is not shared with other  clients. Otherwise the client will overwrite the files of the other client, which will result in lost work.

It is also important to give each client its own Machine ID, which can best be done with the No-Nonsense Console client. The console client is recommended because the Machine ID can be set to a value other than 1. The Graphical (GUI) client for Windows is hard-coded to use a Machine ID of 1. The Machine ID can be set by the console client in the Advanced configuration section of the client setup.

Because of the Machine ID limitation of the GUI client, you can run only one GUI client in a multiple client scenario. All the other clients have to be console clients.

Note: The Windows GUI client (V5 and earlier) or the Windows Systray client (V6) is delivered as as Windows installer. This is NOT true for the Console clients, where only the executable is delivered. When you download the Console client, you are responsible for putting it in whatever directory you want it to reside and for creating whatever shortcuts you wish to use with it.

On Windows (V5 and earlier) you also have to add the Command line switch to the target line of the shortcut which starts the Console clients. Otherwise multiple clients will not run correctly. In V6, the -local switch is the default and if used, it is ignored. See also: How do I add flags using a shortcut to the console client?

The following scenarios will show some examples of how the clients should be configured to utilize all the available CPUs/CPU cores. See also: How do I reconfigure the console client options%3F and How tos in general for configuration tips and tricks.

Example A - 1 Graphical client and 1 Console client
Environment: The Graphical client is installed in The Console client is installed in

Configuration: The GUI client is already configured with its Machine ID set to 1, so the Console client needs to be configured with its Machine ID set to a value higher than 1 and at maximum 8. The most logical choice would be to simply increment the Machine ID by one for each additional client, so we configure the Console client to set its Machine ID to 2. The target line for the shortcut to start the Console client would contain something like:

Example B - 2 or more Console Clients
Environment: The first Console client is installed in The second Console client is installed in

Configuration: The first Console client will be configured to set its Machine ID to 1, and the second Console client will be configured to set its Machine ID to 2. Any additional clients should get a further incremented Machine ID each time, up to 8. The target line for the shortcut to start the first Console client would contain something like: and the target line for the shortcut to start the second Console client: .

The above scenario using multiple Console clients is almost identical to the way it is done with the Linux and Mac OS X clients, but you can skip usage of the parameter which is only used on Windows. Simply configure each client with a unique Machine ID in their own directories, using the -local switch, and you're all set.

Note: This process can be scaled up to 8 console clients in Windows, continuing with Machine IDs 3 through 8, if you have the CPU cores to match. High performance clients like the GPU or SMP client can be configured up to 16 Machine IDs, but you almost never need more than one of them.

Related Info
Vijay Pande, the project lead, made a statement in the old folding forum not to run 2 clients on P4 systems with HT. The quote is posted here for clarity.

There are multiple issues being discussed here.

1) One is the fact that running 2 procs on an HT CPU leads to 2 procs which are each ~60-65% the speed of the original CPU.

2) the other is that there are lots of WUs taken out

''As for #1, please keep in mind that we're running kinetics and there are some calcs which need to reach a certain minimum number of generations before they're useful. Let's say that min number is 10. Having a million RUN,CLONE's reach 5 gens would be useless (and a big waste of CPUs). Having even 50,000 reach past 10 gens would be fantastic. This is different than other d.c. projects (which just hand out work which can be done in any order). 10 gen minimimums is not always the case, but does come up more and more these days as we go to more complex systems. Thus, this hasn't been a huge limiting problem in the past, but is becoming a greater and greater issue that I'm watching.''

''In the case of running 2 procs on HT, one slows done the return of each WU, even though 2 are returned. This could still be a disaster for FAH since, while we're getting 2x more CPUs, we're getting slower CPUs, which might make the WUs useless. Let's put it another way, how much would a gamer or a scientist pay for a 7 GHz P4? If they existed, they would cost a lot more than two 3.5GHz P4's. It's a waste to then turn around and just run 2 procs on that 7 GHz machine.''

''Anyway, Riz does make a good point that fast P4's can run 2 HT procs and will be within deadlines. I should stress that I can see the logic for why to do it and this is not "cheating" either in the letter or spirit of FAH stats. However, the logic for doing this is based not on science, but on stats. In this case, what's good for stats can be non-ideal for science. That makes me think that the stats should be tweaked to reward people for returning WUs in a means that's best for science.''

I agree that #2 is unrelated to having lots of WUs out, since each HT proc would appear as another CPU.

Links











 * How to set an application's CPU affinity permanently