================================

Parallel computing with BAPS 5

================================

- Jing Tang 22/11/2007

 

===============================

Q1: What does parallel mean here?

===============================

A1: Parallel computing aims to distribute the tasks of BAPS analysis into multiple computers such that the results will be produced faster. The idea is based on the fact that the BAPS analysis usually can be divided into smaller parallel tasks, which may be carried out simultaneously.

 

===============================================================

Q2: In what kind of a situation can I run parallel computing?

===============================================================

A2: Parallel computing is enabled for both the mixture and admixture analysis, but with different purposes.

 

A2.1 Mixture analysis

    For mixture analysis, parallel computing can make significant improvement on decreasing the computation time if the data is large and fuzzy. The fuzziness of a dataset is evaluated by the uncertainty of finding the optimal partition in limited runs. Of course one cannot know the fuzziness of the data prior to the analysis, but for large datasets it is always recommended to use multiple estimation runs in BAPS analysis by choosing different initial number of clusters, and to check whether the partition result from each run is stable or not.

    The idea of parallel computing is to distribute the task of multiple runs into multiple computers, such that on each computer only a subset of the possible initial cluster numbers are tested. The partition result identified by each 'local' computer will be compared with each other such that the 'global' optimal partition can be determined.

 

A2.2 Admixture analysis

    The parallel computing for admixture analysis is based on the algorithm in which the individuals of one cluster in a partition can be computed independently from other clusters. Thus each cluster can be made to run on a separate computer to achieve the results faster.

    Therefore, parallel computing for admixture is always recommended for large and complex data sets as long as there are multiple computers available.

 

A2.3 The case where parallel computing should NOT be used

    IF LINKAGE CLUSTERING model is used WHEN the data contains missing values, and the data HAS NOT been pre-processed into a .mat format, then parallel computing is not allowed. For such a case, missing values are estimated stochastically during the pre-processing step. The complete data used afterwards in the mixture analysis might not be identical for different runs. This data inconsistency will invalidate the ground for parallel computing since the partition results are no longer comparable. However, parallel computing can be used if the data with missing values has already been pre-processed. For the independent loci model the parallel computing is ALWAYS allowed.

 

============================================

Q3: How to carry out the parallel computing?

============================================

A3: Suppose we want to compare the partition results given by different initial number K of clusters. Such a task can be distributed by parallel computation into multiple computers, each of which handles only a subset of Ks. For instance we have a dataset called 'simple_data.txt' and we need to compare the partition results based on initial K = 2, 3, 4, 5. Given that two computers are available, the whole process will be implemented in the following:

1)    Write a script file that contains the parameters to run a subtask in the first computer. Here we give a script example indicating that mixture analysis should start with K=2 and 3 (Figure 1). The details of writing a script are described in section Q4.

     

               Figure 1: A script prepared for a mixture analysis on computer No. 1. It starts a mixture analysis with initial K=2 and 3.

 

2)    Open a command prompt window if using Windows, or open a shell window if using Linux/Unix. Go to the directory where the BAPS package is installed. Type 'baps5' followed by the COMPLETE name of the script file and press ENTER. NB! Change the string ‘baps5’ in the command according to the program version used. (Figure 2)

              Figure 2: Start  a sub-task with the script.  Replace 'E:\BAPS_release' with

              the installation directory of your own BAPS package. Also remember to put the full path of your script file in the behind.

 

3)    BAPS 5 will be evoked without displaying the GUI and produce the result as the script required. (Figure 3)

         Figure 3: The partition result. The result has been saved into the file as specified by the 'outputmat' command in the script.

4)    Repeat the steps above on the second computer to run the remaining subtasks. Here we show the second script to obtain the mixture results for initial K = 4 and 5 (Figure 4).

 

        Figure 4:  The script for computer No.2. It starts a mixture analysis with initial K=4 and 5.

   5) Collect all of the result files into one computer and start the GUI version of BAPS 5.0. Click on the 'Compare results' in the 'File' menu and load them into BAPS for comparison (Figure 5). Here we loaded the two result files: simple_result_for_computer_1.mat and simple_result_for_computer_2.mat. The comparing results will be shown in the back window.

                                                              Figure 5: The partition file selection window

 

 

 

 

==================================

Q4: How to write a script file?

==================================

A4.1: Syntax

    A script file consists of several lines of commands. Each command line writes in a format as

                      commandname('parameter')

    For example:

    datafile('c:\data\data1.txt')

    mixturetype('mix')

    initialk(10)

    ...

    NB!: Non-numerical parameters are always enclosed in single quotation marks.

 

A4.2: Universal commands

      - Universal commands are mandatory and should be used irrespective of mixture/admixture types.

  

datafile - Specify the full path of the data

          datafile('c:\data\data1.txt')

mixturetype - Specify the mixture/admixture model

          mixturetype('mix') - independent mixture model

          mixturetype('linear_mix') - linear linkage model

          mixturetype('codon_mix') - codon linkage model

          mixturetype('spatial') - spatial mixture model

          mixturetype('ad_mix') - admixture model

outputmat - Specify the result .mat file. The suffix of the result file must be .mat

      outputmat('c:\result\result1.mat')  

 

A4.3: Mandatory commands for mixture analysis

      - These commands should be used for any of the mixture models.

 

initialk - The initial number of clusters K

         initialk(10 11 12)

         initialk(10:12)

         initialk(10) - if fixedk('yes') then use single value.

fixedk - Whether K is fixed during the analysis

         fixedk('yes')

         fixedk('no')

datatype - Specify the data type

     datatype('numeric') - numeric baps data

     datatype('sequence') - sequence baps data, for linkage model only

     datatype('excel') - excel baps data, for linkage model only

     datatype('matlab') - preprocessed baps data

     datatype('genepop') - genepop data

     

A4.4: Optional commands for mixture analysis

namefile - Specify the population name file

          namefile('c:\data\data1_name.txt')

indexfile - Specify the population index file

          indexfile('c:\data\data1_index.txt')

groups - Indicate clustering of groups instead of individuals

      groups('yes')

      groups('no')

groupname - Specify the group name file

      groupname('c:\data\data1_groupname.txt')

coordinatefile - Specify the coordinate file, for spatial model only when the data type is other than 'matlab'.

linkagemap - Specify the linkage map file, for linkage model only when the data type is other than 'matlab' or 'excel'.

 

 

A4.5 Mandatory commands for admixture analysis

 clusters - Specify which of the clusters are analyzed

       clusters(1 3 5) - cluster 1, 3 and 5 will be analyzed.

 minsize - Specify the minimal size of population

       minsize(5) - minimal size of population is 5 individuals

 iters - Specify the number of iterations

       iters(50)

 refinds - Specify the number of reference individuals

       refinds(10)

 refiters - Specify the number of iterations for reference individuals

       refiters(20)

 

 

========================================================

Q5: Examples of script file

========================================================

Q5.1 Example 1 - BAPS data, independent mixture model

 

datafile('r:\baps5\data\simple_data.txt');

mixturetype('mix')

outputmat('r:\baps5\data\simple_mix_1.mat')

initialk(2 3)

fixedk('no')

datatype('numeric')

 

Q5.2 Example 2 - Preprocessed data, independent mixture model, fixed-k partition

datafile('e:\baps5\data\baps_data.mat');

mixturetype('mix')

outputmat('e:\baps5\data\baps_mix.mat')

initialk(10)

fixedk('yes')

datatype('matlab')

 

Q5.3 Example 3 - Excel data, codon linkage model

datafile('r:\baps5\data\simple_data.xls');

mixturetype('codon_mix')

outputmat('r:\baps5\data\simple_codon_mix.mat')

initialk(2 3)

fixedk('no')

datatype('excel')

 

Q5.4 Example 4 - BAPS data, spatial mixture model

datafile('r:\baps5\data\simple_data.txt');

mixturetype('spatial')

outputmat('r:\baps5\data\simple_spatial_mix.mat')

initialk(2 3)

fixedk('no')

datatype('numeric')

coordinatefile('r:\baps5\data\simple_cood.txt')

 

Q5.5 Example 5 - Admixture analysis on cluster 3 and 4

datafile('e:\baps5\data\ssuis_mixture_results_codon.mat')

mixturetype('ad_mix')

outputmat('e:\baps5\data\ssuis_small_script_2.mat')

clusters(3:4)

minsize(5)

iters(2)

refinds(3)

refiters(4)