The file "analyze.cfg" is used to setup avida when its in analysis-only mode, which can be done by running "avida -a". It is used to perform additional tests on genotypes after a run has completed.
This analysis language is basically a simple programming language. The structure of a program involves loading in genotypes in one or more "batches", and then either manipulating single batches, or doing comparisons between batches. Currently there can be up to 300 batches of genotypes, but we will eventually remove this limit.
The rest of this file describes how individual commands work, as well as some notes on other languages features, like how to use variables. As a formatting guide, command arguments will be presented between brackets, such as [filename]. If that argument is mandatory, it will be in blue. If it is optional, it will be in green, and (if relevant) a default value will be listed, such as [filename="output.dat"]
There are currently four ways to load in genotypes:
Table 1: Genotype Loading Commands
LOAD_ORGANISM [filename] Load in a normal single-organism file of the type that is output from avida. These consist of lots of organismal information inside of comments, and then the full genome of the organism with one instruction per line. |
LOAD_BASE_DUMP [filename] Load in a basic dump file from avida. Each line contains a genotype sequence, but little additional information. |
LOAD_DETAIL_DUMP [filename] Load in a detail file. These are similar to the basic dump files, but contain a lot more information on each line. These files are saved from avida typically beginning with the word "detail" or "historic". |
LOAD_SEQUENCE [sequence] Load in a user-provided sequence as the genotype. Avida has a symbol associated with each instruction; this command is simply followed by a sequence of such symbols that is than translated back into a proper genotype. |
A future addition to this list is a command that will use the "dominant.dat" file to identify all of the dominant genotypes from a run, and then lookup and load their individual genomes from the genebank directory. Also, the commands LOAD_BASE_DUMP and LOAD_DETAIL_DUMP currently require fixed-formated files. New output files from avida have tags listed for their column names, and as such we already have a working prototype of a generic "LOAD" command that will figure out the file format and be able to load in all of the data properly.
All of the load commands place the new genotypes into the "current" batch, which can be set with the "SET_BATCH" command. Below is the list of control functions that allow you to manipulate the batches.
Table 2: Batch Control Commands
SET_BATCH [id] Set the batch that is currently active; the initial active batch at the start of a program is 0. |
NAME_BATCH [name] Attach a name to the current batch. Some of the printing methods will print data from multiple batches, and we want the data from each batch to be attached to a meaningful identifier. |
PURGE_BATCH [id=current] Remove all genotypes in the specified batch (if no argument is given, the current batch is purged. |
DUPLICATE [id1]
[id2=current] Copy the genotypes from batch id1 into id2. By default, copy id1 into the current batch. Note that duplicate is non-destructive so you should purge the target batch first if you don't want to just add more genotypes to the ones already in that batch. |
STATUS Print out (to the screen) the genotype count of each non-empty batch and identify the currently active batch. |
There are several other commands that will allow you to interact with the analysis mode in some very important ways, but don't actually trigger any analysis tests or output. Below are a list of some of the more important control commands.
Table 3: More Analysis Control Commands
VERBOSE Toggle verbose/minimal messages. Verbose messages will print all of the details of what is happening to the screen. Minimal messages will only briefly state the process being run. Verbose messages are recommended if you're in interactive mode. |
SYSTEM [command] Run the command listed on the command line. This is particularly useful if you need to unzip files before you can use them, or if you want to delete files no longer in use. |
INCLUDE [filename] Include another file into this one and run its contents immediately. This is useful if you have some pre-written routines that you want to have available in several analysis files. Watch out because there are currently no protections against circular includes. |
INTERACTIVE Place Avida analysis into interactive mode so that you can type commands have have them immediately acted upon. You can place this anywhere within the analyze file, so that you can have some processing done before interactive mode starts. You can type "quit" at any point to continue with the normal processing of the file. |
DEBUG [message] This is an "echo" command that will print a message (its arguments) on the screen. If there are any variables (see below) in the message, they will be translated before printing, so this is a good way of debugging your programs. |
Now that we know how to interact with analysis mode, and load in genotypes, its important to be able to manipulate them. The next batch of commands will do basic analysis on genotypes, and allow the user to prune batches to only include those genotypes that are needed.
Table 4: Genotype Manipulation Commands
RECALCULATE Run all of the genotypes in the current batch through a test CPU and record the measurements taken (fitness, gestation time, etc.). This overrides any values that may have been loaded in with the genotypes. |
FIND_GENOTYPE [type="num_cpus" ...] Remove all genotypes but the one selected. Type indicates which genotype to choose. Options available for type are "num_cpus" (to choose the genotype with the maximum organismal abundance at time of printing), "total_cpus" (number of organisms ever of this genotype), "fitness", or "merit". If a the type entered is numerical, it is used as an id number to indicate the desired genotype (if no such id exists, a warning will be given). Multiple arguments can be given to this command, in which case all those genotypes in that list will be preserved and the remainder deleted. |
FIND_LINEAGE [type="num_cpus"] Delete everything except the lineage from the chosen genotype back to the most distant ancestor available. This command will only function properly if parental information was loaded in with the genotypes. Type is the same as the FIND command. |
ALIGN Create an alignment of all the genome's sequences; It will place '_'s in the sequences to show the alignment. Note that a "FIND_LINEAGE" must first be run on the batch in order for the alignment to be possible. |
SAMPLE_ORGANISMS [fraction] Keep only "fraction" of organisms in the current batch. This is done per organism, not per genotype. Thus, genotypes of high abundance may only have their abundance lowered, while genotypes of abundance 1 will either stay or be removed entirely. |
SAMPLE_GENOTYPES [fraction] Keep only fraction of genotypes in the current batch. |
RENAME [start_id=0] Change the id numbers of all the genotypes to start at a given value. Often in long runs we will be dealing with ID's in the millions. In particular, after reducing a batch to a lineage, we will often want to number the genotypes in order from the ancestor to the final one. |
Next, we are going to look at the standard output commands that will used to save information generated in analyze mode.
Table 5: Basic Output Commands
PRINT [dir="genebank/"] Print the genotypes from the current batch as individual files (one genotype per file) in the directory given. The files will be named by the genotype name, with a ".gen" appended to them. |
TRACE [dir="genebank/"] Trace all of the genotypes and print a listing of their execution. This will show step-by-step the status of all of the CPU components and the genome during the course of the execution. The filename used for each trace will be the genotype's name with a ".trace" appended. |
PRINT_TASKS [file="tasks.dat"] This will print out the tasks doable by each genotype, one per line in the output file specified. Note that this information must either have been loaded in, or a RECALCULATE must have been run to collect it. |
DETAIL [file="detail.dat"] [format ...] Print out all of the stats for each genotype, one per line. The format indicates the layout of columns in the file. If the filename specified ends in ".html", html formatting will be used instead of plain text. For the format, see the section on "Output Formats" below. |
And at last, we have the actual analysis commands that perform tests on the data and output the results.
Table 6: Analysis Commands
LANDSCAPE [file="landscape.dat"]
[dist=1] For each genotype in the current batch, test all possible mutations (or combinations of mutations if dist > 1) and summarize the results, one per line in the specified file. |
MAP_TASKS [dir="phenotype/"]
[flags ...]
[format ...] Construct a genotype-phenotype array for each genotype in the current batch. The format is the list of stats that you want to include as columns in the array. Additionally you can have special format flags; the possible flags are "html" to print output in HTML format, and "link_maps" to create html links between consecutive genotypes in a lineage. |
MAP_MUTATIONS [dir="mutations/"]
[flags ...] Construct a genome-mutation array for each genotype in the current batch. The format has each line in the genome as a row in the chart, and all available instructions representing the columns. The cells in the chart indicate the fitness were a mutation to occur at the position in the matrix, to the listed instruction. If the "html" flag is used, the charts will be output in HTML format. |
AVERAGE_MODULATITY [file="modularity.dat"]
[task.0 task.1 task.2 task.3 task.4 task.5
task.6 task.7 task.8] Calculate several modularity measuers, such as how many tasks is an instruction involved in, number of sites required for each task, etc. The measures are averaged over all the organisms in the current batch that perform any tasks. For the full output list, do "AVERAGE_MODULATITY legend.dat" At the moment doesn't support html output format and works with only 1 and 2 input tasks. |
HAMMING [file="hamming.dat"]
[b1=current]
[b2=b1] Calculate the hamming distance between batches b1 and b2. If only one batch is given, calculations are on all pairs within that batch. |
LEVENSTEIN [file="lev.dat"]
[b1=current]
[b2=b1] Calculate the levenstein distance (edit distance) between batches b1 and b2. This metric is similar to hamming distance, but calculates the minimum number of single insertions, deletions, and mutations to move from one sequence to the other. |
SPECIES [file="species.dat"]
[b1=current]
[b2=b1] Again this is similar to hamming distance, but calculates if genotypes would be considered the same species. Output: Batch1Name Batch2Name AveDistance Count FailCount |
Several commands (such as DETAIL and MAP) require format parameters to specify what genotypic features should be output. Before the such commands are used, other collection functions may need to be run.
Allowable formats after a normal load (assuming these values were available from the input file to be loaded in) are:
id (Genome ID) | parent_id (Parent ID) | num_cpus (Number of CPUs) |
total_cpus (Total CPUs Ever) | length (Genome Length) | update_born (Update Born) |
update_dead (Update Dead) | depth (Tree Depth) | sequence (Genome Sequence) |
After a RECALCULATE, the additional formats become available:
viable (Is Viable [0/1]) | copy_length (Copied Length) | exe_length (Executed Length) |
merit (Merit) | comp_merit (Computational Merit) | gest_time (Gestation Time) |
efficiency (Replication Efficiency) | fitness (Fitness) | div_type (Divide type used; 1 is default) |
If a FIND_LINEAGE was done before the RECALCULATE, the parent genotype for each regular genotype will be available, enabling the additional formats:
parent_dist (Parent Distance) | comp_merit_ratio, (Computational Merit Ratio with parent) |
efficiency_ratio (Replication Efficiency Ratio with parent) | fitness_ratio (Fitness Ratio with parent) |
parent_muts (Mutations from Parent) | html.sequence (Genome Sequence in Color; html format) |
Finally, if an ALIGN is run, one additional format is available: alignment (Aligned Sequence)
For the moment, all variables can only be a single character (letter or number) and begin with a $ whenever they need to be translated to their value. Lowercase letters are global variables, capital letters are local to a function (described later), and numbers are arguments to a function. A $$ will act as a single dollar sign, if needed.
Table 7: Variable-Related Commands
SET [variable]
[value] Sets the variable to the value... |
FOREACH [variable]
[value]
[value ...] Set the variable to each of the values listed, and run the code that follows between here and the next END command once for each of those values. |
FORRANGE [variable]
[min_value]
[max_value]
[step_value=1] Set the variable to each of the values between min and max (at steps given), and run the code that follows between here and the next END command, once for each of those values. |
These functions are currently very primitive with fixed inputs of $0 through $9. $0 is always the function name, and then there can be up to 9 other arguments passed through. Once a function is created, it can be run just like any other command.
Table 8: Function-Related Commands
FUNCTION [name] This will create a function of the given name, including in it all of the commands up until an END is found. These commands will be bound to the function, but are not executed until the function is run as a command. Inside the function, the variables $1 through $9 can be used to access arguments passed in. |
Currently there are no conditionals or mathematical commands in this
scripting language. These are both planned for the future.