9.0 Input File
Formats and Conversion Program
This file contains
documentation of the program convertf, which converts between the
5 different file formats we support.
Note that "file format" simultaneously refers to the formats of
three distinct files:
Below, we document all 5
formats:
and we explain how to use
convertf
to get from one format to another. Note all the example files are in the
directory:
ANCESTRYMAP Format:
The genotype file
contains 1 line per valid genotype, and has 3 columns:
|
SNP_ID |
Sample_ID |
Number of Variant
Alleles (0,1 or 2) |
Missing genotypes are encoded
by the absence of an entry in the genotype file.
The snp file contains 1
line per SNP. There are 4 columns:
|
SNP_ID |
Chromosome_Num |
Genetic_Position |
Physical_Position |
Use 23 for X chromosome. The genetic position can
be in Morgans or centiMorgans, and the physical position is in bases.
The indiv file contains 1
line per individual, and has 3 columns:
|
Sample_ID |
Gender |
Status |
The gender column can be M(male), F(female) or
U (unknown). The status column might refer to Case or Control status, or might
be a population group label. If this
entry is set to "Ignore", then that individual and all genotype data
from that individual will be removed from the data set in all convertf output.
The name "ANCESTRYMAP format" is used for historical reasons
only. This software is completely
independent of our 2004 ANCESTRYMAP software.
EIGENSTRAT Format: Used by EIGENSTRAT (both in the
§
genotype
file: see example.eigenstratgeno
§
snp
file: see example.snp (same as
above)
§
indiv
file: see example.ind (same as above)
The genotype file
contains 1 line per SNP. Each line contains 1 character per individual:
0 means zero copies of reference allele.
1 means one copy of reference allele.
2 means two copies of reference allele.
9 means missing data.
The program ind2pheno.perl
in this directory will convert from example.ind
to the example.pheno file needed by
the EIGENSTRAT software. To run this script type on the command line:
>> ./ind2pheno.perl
example.ind example.pheno
PED Format:
convertf
also supports the full .ped file (example.ped) for this input file
Note that, mandatory
suffix names enable our software to recognize this file format.
The indiv file contains
the first 7 columns of the genotype file (see below).
The genotype file is 1
line per individual. Each line contains
7 columns of information about the individual, plus two genotype columns for
each SNP in the order the SNPs are specified in the snp file.
The first 7 columns are:
convertf does not support pedigree
information, so 1st, 3rd, 4th columns are ignored in convertf input and set to
arbitrary values in convertf output. In the two genotype columns for each SNP,
missing data is represented by 0.
The snp file contains 1
line per SNP. There are 4 columns:
|
Chromosome_Num |
SNP_ID |
Genetic_Position |
Physical_Position |
Use X for X chromosome.
The genetic position is in Morgans, and the physical position is in bases.
The indiv file contains
the first 7 columns of the genotype file.
The PED format is used by
the PLINK package of Shaun Purcell. See
http://pngu.mgh.harvard.edu/~purcell/plink/.
PACKEDPED Format:
convertf
also supports a .ped file (example.ped) for this input file
Note that, mandatory
suffix names enable our software to recognize this file format.
example.bed is a packed
binary file (2 bits per genotype).
The PACKEDPED format is
used by the PLINK package of Shaun Purcell. See
http://pngu.mgh.harvard.edu/~purcell/plink/.
For input in PACKEDPED
format, snp file MUST be in genomewide order.
For input in PACKEDPED
format, genotype file MUST be in SNP-major order (the PLINK default: see PLINK
documentation for details.)
PACKEDANCESTRYMAP Format:
Note that,
example.packedancestrymapgeno is a packed binary file (2 bits per genotype).
DOCUMENTATION OF convertf program:
To run this program type
on the command line:
>> /bin/convertf -p parfile
We illustrate how parfile works via a toy example: (see
example.perl in this directory)
par.ANCESTRYMAP.EIGENSTRAT
converts ANCESTRYMAP to EIGENSTRAT format
par.EIGENSTRAT.PED converts EIGENSTRAT to PED
format
par.PED.EIGENSTRAT
converts PED to EIGENSTRAT format
par.PED.PACKEDPED
converts PED to PACKEDPED format
par.PACKEDPED.PACKEDANCESTRYMAP converts
PACKEDPED to PACKEDANCESTRYMAP
par.PACKEDANCESTRYMAP.ANCESTRYMAP converts PACKEDANCESTRYMAP to ANCESTRYMAP
Note that the choice of
which allele is the reference allele may be arbitrary and thus converting to a
new format and back again may change the choice of reference allele.
DESCRIPTION OF EACH PARAMETER in parfile for
convertf:
|
Parameter Name |
Data type |
Description |
Possible and Default values |
|
genotypename |
String |
input genotype file |
|
|
snpname |
String |
input snp file |
|
|
outputformat |
String |
Can be one of the
following: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or
PACKEDANCESTRYMAP |
|
|
genotypeoutname |
String |
output genotype file |
|
|
snpoutname |
String |
output snp file |
|
|
indivoutname |
String |
output indiv file |
|
|
OPTIONAL PARAMETERS |
|||
|
familynames |
String |
Only relevant if input
format is PED or PACKEDPED. |
|
|
noxdata |
Boolean |
If set to YES, all SNPs
on X chromosome are removed from the data set. |
|
|
nomalexhet |
Boolean |
If set to YES, any het
genotypes on X chr for males are changed to missing data |
|
|
badsnpname |
String |
Specifies a list of
SNPs which should be removed from the data set |
|
|
outputgroup |
Boolean |
Only relevant if
outputformat is PED or PACKEDPED |
NO |