<<Previous

Next >>

Contents

Home

 

9.0 Input File Formats and Conversion Program

 

This file contains documentation of the program convertf, which converts between the 5 different file formats we support.  Note that "file format" simultaneously refers to the formats of three distinct files:

 

Below, we document all 5 formats:

and we explain how to use convertf to get from one format to another. Note all the example files are in the directory:

 

ANCESTRYMAP Format:

 

The genotype file contains 1 line per valid genotype, and has 3 columns:

SNP_ID

Sample_ID

Number of Variant Alleles (0,1 or 2)

Missing genotypes are encoded by the absence of an entry in the genotype file.

The snp file contains 1 line per SNP.  There are 4 columns:

SNP_ID

Chromosome_Num

Genetic_Position

Physical_Position

 Use 23 for X chromosome. The genetic position can be in Morgans or centiMorgans, and the physical position is in bases.

The indiv file contains 1 line per individual, and has 3 columns:

Sample_ID

Gender

Status

 The gender column can be M(male), F(female) or U (unknown). The status column might refer to Case or Control status, or might be a population group label.  If this entry is set to "Ignore", then that individual and all genotype data from that individual will be removed from the data set in all convertf output. The name "ANCESTRYMAP format" is used for historical reasons only.  This software is completely independent of our 2004 ANCESTRYMAP software.

 

EIGENSTRAT Format: Used by EIGENSTRAT (both in the 07/23/06 release and in the current release).

§         genotype file: see example.eigenstratgeno

§         snp file:      see example.snp (same as above)

§         indiv file:    see example.ind (same as above)

 

The genotype file contains 1 line per SNP. Each line contains 1 character per individual:

  0 means zero copies of reference allele.

  1 means one copy of reference allele.

  2 means two copies of reference allele.

  9 means missing data.

The program ind2pheno.perl in this directory will convert from example.ind to the example.pheno file needed by the EIGENSTRAT software. To run this script type on the command line:

>> ./ind2pheno.perl example.ind example.pheno

 

PED Format:

convertf also supports the full .ped file (example.ped) for this input file

 

Note that, mandatory suffix names enable our software to recognize this file format.

The indiv file contains the first 7 columns of the genotype file (see below).

The genotype file is 1 line per individual.  Each line contains 7 columns of information about the individual, plus two genotype columns for each SNP in the order the SNPs are specified in the snp file. 

 The first 7 columns are:

 

 convertf does not support pedigree information, so 1st, 3rd, 4th columns are ignored in convertf input and set to arbitrary values in convertf output. In the two genotype columns for each SNP, missing data is represented by 0.

The snp file contains 1 line per SNP.  There are 4 columns:

Chromosome_Num

SNP_ID

Genetic_Position

Physical_Position

Use X for X chromosome. The genetic position is in Morgans, and the physical position is in bases.

The indiv file contains the first 7 columns of the genotype file.

The PED format is used by the PLINK package of Shaun Purcell. See http://pngu.mgh.harvard.edu/~purcell/plink/.

 

PACKEDPED Format:

            convertf also supports a .ped file (example.ped) for this input file

 

Note that, mandatory suffix names enable our software to recognize this file format.

example.bed is a packed binary file (2 bits per genotype).

The PACKEDPED format is used by the PLINK package of Shaun Purcell. See http://pngu.mgh.harvard.edu/~purcell/plink/.

For input in PACKEDPED format, snp file MUST be in genomewide order.

For input in PACKEDPED format, genotype file MUST be in SNP-major order (the PLINK default: see PLINK documentation for details.)

 

PACKEDANCESTRYMAP Format:

Note that, example.packedancestrymapgeno is a packed binary file (2 bits per genotype).

 

 

DOCUMENTATION OF convertf program:

To run this program type on the command line:

>> /bin/convertf -p parfile

 

We illustrate how parfile works via a toy example: (see example.perl in this directory)

par.ANCESTRYMAP.EIGENSTRAT        converts ANCESTRYMAP to EIGENSTRAT format

par.EIGENSTRAT.PED                converts EIGENSTRAT to PED format

par.PED.EIGENSTRAT                converts PED to EIGENSTRAT format

par.PED.PACKEDPED                 converts PED to PACKEDPED format

par.PACKEDPED.PACKEDANCESTRYMAP   converts PACKEDPED to PACKEDANCESTRYMAP

par.PACKEDANCESTRYMAP.ANCESTRYMAP converts PACKEDANCESTRYMAP to ANCESTRYMAP

 

Note that the choice of which allele is the reference allele may be arbitrary and thus converting to a new format and back again may change the choice of reference allele.

 

DESCRIPTION OF EACH PARAMETER in parfile for convertf:

Parameter Name

Data type

Description

Possible and Default values

genotypename

String

input genotype file

 

 

snpname

String

input snp file

 

 

outputformat

String

Can be one of the following:

ANCESTRYMAP,  EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP

 

genotypeoutname

String

output genotype file

 

snpoutname

String

output snp file

 

indivoutname

String

output indiv file

 

 

OPTIONAL PARAMETERS

familynames

String

Only relevant if input format is PED or PACKEDPED.

 

noxdata

Boolean

If set to YES, all SNPs on X chromosome are removed from the data set.

 

nomalexhet

Boolean

If set to YES, any het genotypes on X chr for males are changed to missing data

 

badsnpname

String

Specifies a list of SNPs which should be removed from the data set

 

outputgroup

Boolean

Only relevant if outputformat is PED or PACKEDPED

NO

 

 

<<Previous

Next >>

Contents

Home