Runs, Clones and Gens

From FaHWiki
Jump to: navigation, search

An explanation by Dan Ensign

First, let's review some basic physics. The key idea is that of a "trajectory." You might recall Newton's Second Law, F = ma, which means that the acceleration a (change in velocity) that a particle experiences is proportional (by its mass m) to the force F it experiences. This means that if we can catalog all the forces on a particle, we can determine its acceleration. If we know the acceleration, then we can use calculus to determine the particle's position as a function of time, for all time. The result is what's called a 'trajectory' -- a kind of map of where the particle has been and where it will be going. By the way, when I say 'particle,' I mean that we could perform this analysis on atoms, protein molecules, baseballs, the space shuttle, the Sun, or anything in between.

The analysis gets a lot harder the more particles there are in the system -- for instance, if you set up a system with the Earth and the Sun as two particles, experiencing each others' gravity, then you can solve Newton's Second Law very easily and write down a function which describes the position of the Sun and the Earth at all times. If you include the moon or other planets, then you can't write down functions like this, though you can solve Newton II numerically. This is what we do for FAH -- solve Newton II numerically for thousands of atoms, thousands of times, once every femtosecond or so (that's "ten-to-the-minus-15" seconds). What we get is a trajectory for the protein atoms.

If we're simulating protein folding, then perhaps the trajectory will result in a folded protein. Perhaps not -- we don't have a way to say for sure how this happens for an arbitrary starting conformation. (But we're studying it, obviously, thanks to our Army of Undea -- oops, I mean FAH clients. The Army of Undead is for a different project entirely.)

Now, on my desktop machine at work, I can simulate a system of about 16,000 atoms moving for 1 nanosecond (ns, or "ten-to-the-minus-9" seconds) in one day. But the protein that I'm folding requires (on average) one microsecond ("ten-to-the-minus-6" seconds) to fold -- and this is a system engineered to fold fast. To get to one microsecond on my desktop machine, I'd have to fold for 1,000 days. Forget about "average" proteins, which might take hundreds of microseconds, or milliseconds, to fold.

Maybe I'd get lucky and the protein would fold in that time; maybe I wouldn't, and they'd find me 35 years later, in some sub-subbasement below the chemistry building at Stanford, a raving lunatic lost to the dredges of Ph. D. research, sneaking out only at night to feed on spilled yeast extract and collecting discarded NMR tubes to wear as primitive jewelry. (I heard this happened to a guy.)

To avoid life-wasting tragedy, we (and when I say "we" I mean, "Someone besides me, but who I know") has recruited hundreds of thousands of generous and interested persons ("you guys") to give us a hand with some of this work. I could run a trajectory for 1,000 days, but instead we've taken a shortcut and decided to run 1,000 or 10,0000 or 100,000 trajectories for a few days (or months or years) instead. On average, a few of these trajectories will result in a folded protein (and we have ways of yielding interesting and important information from all of the work done on FAH).

Okay, here it is: The CLONE numbers are labels for each trajectory that we run. Each GENeration is another chunk of time along that trajectory. So, say that I benchmark CLONE0, GEN0 (the first 4 ns). That WU is then done, and the FAH software builds a new WU with starting coordinates (and velocities and stuff) where mine left off. Then the new WU -- GEN1 of CLONE0 -- gets sent to you, and you simulate the next 4 ns. And so on. So CLONE is a label for an individual trajectory, and GENerations are time steps along that trajectory.

RUNs are groups of similar CLONEs. All the CLONEs in a RUN have the exact same atoms, the exact atom positions, the same temperature, etc. The difference is the starting velocities -- the initial motions of all the atoms in the protein are randomized. Although statistically the velocities are determined by the temperature, there are countless ways of partitioning the velocities to the atoms, so we try out 100 or so CLONEs to get a good feel for the sample space. Assigning different velocity sets to the atoms turns out to be wildly important: if the conformation we start with happens to represent the transition state (sort of halfway from folded, halfway from unfolded) then 50 of our 100 CLONEs will fold, and 50 won't.

The different RUNs in a PROJect might, in their simplest form, represent different starting conformations. So, we could start off 100 RUNs of different partially unfolded structures and try to find the one for which half of its CLONEs fold -- then that RUN has the conformation of a representative of the transition state.

So why is this transition state doohickey so important? The folded state is relatively easy to identify, especially if experimentalists have determined the structure for the protein under scrutiny, or for a very similar one. The "unfolded state" is a bit harder, but we can generate unfolded conformations by, say, simulating the folded protein at high temperatures so it "melts," or we can thread the amino acid sequence on a set of randomly coiled noodles, or whatever. But the path which connects "unfolded" protein with folded protein is not so easy to get to -- but if we identify the transition state, then we've found (at least one of) the paths by which proteins fold, and that's research in protein folding.

The RUNs might also represent slightly different proteins -- for instance, different mutants of some protein. They might represent other things that I haven't thought of, but whatever they are they are similar enough to other RUNs in the same PROJect, that, well, they're part of the same project.

So to summarize, when I'm setting up a project, I might do the following:

  1. Pick 100 different unfolded or partially unfolded conformations of my protein of interest. These become my RUNs.
  2. Then, I set up 100 different CLONEs for each RUN. (Well, I don't actually set them up myself, I just run a program. But I run it really well. And intelligently. And I look good doing it.) Each CLONE contains one WU at this point.
  3. Then, I let the (100 RUNs) x (100 CLONEs) = 10,000 WUs loose on the world ("you guys").
  4. Then, I go have lunch.
  5. I come back weeks later to find WUs crunched and GENerations progressing -- each of the original 10,000 WUs was the beginning of one trajectory, so at the end, I have 10,000 trajectories of 50 or 100 or more ns.
  6. Finally, I sift through the data and learn something new about protein folding!

And so it goes. I'm still new at this, so I haven't actually done steps 4, 5, or 6 yet, but I've got a good handle on 1, 2, and 3, and now it's a matter of waiting (and doing 1, 2, and 3 a lot more).

Bruce has just correctly pointed out to me that this isn't always true (although it's true nearly all of the time). In some instances -- when different trajectories are made to interact -- the "next generation" can't be built until all the other CLONEs have returned WUs of the same generation.

This happens for instance when doing "Replica Exchange Molecular Dynamics," for which the different CLONEs would be trajectories run at different temperatures (at least I think this is how it works ...). Sometimes, the atom coordinates between different trajectories need to be swapped in REMD, and hence you need to wait for the CLONEs to all have generation n finished to build GEN n+1 WUs.

I think.

Try (hope I got that right). In the end, AIUI, doing REMD with FAH is a pain compared to just doing it on a supercomputer -- we'd rather use FAH for its strengths ("a freaking lot of processors").

Technical summary...

  • Project # is the numerical designation for the initial set of work unit parameters. Project numbers are not repeated (although they have been in the past, by mistake.) They are also numerically grouped by researcher / field of study.
  • Run is a numeric designation for groupings of Clones with identical work unit attributes, the exact same atoms, the exact atom positions, the same temperature, except for different (randomized) starting velocities.
  • Clone is a numeric designation for each trajectory, and "trajectory" is the very technical thing here. See the first paragraph above.
  • Generation is a numeric designation for a pre-determined length of time along a trajectory (Clone) for a specific project.

Reference Links

Personal tools