Question: Is a Mac or Windows version of EIGENSOFT available?
Answer: No. Due to our limited resources we are only able to support
Linux at this time.
Question: Is documentation available for EIGENSOFT?
Answer: Yes. See main README file included in the software release, and
additional documentation files referenced therein.
Question: What's inside the 3 directories CONVERTF, POPGEN, and EIGENSTRAT?
Answer: The CONVERTF directory contains documentation and examples of our
convertf program for converting file formats. The POPGEN directory contains
documentation and examples of our smartpca program for running PCA.
The EIGENSTRAT directory contains documentation and examples of correcting for
population stratification in disease studies using the EIGENSTRAT approach,
as well as a PERL wrapper smartpca.perl for running the smartpca program.
Question: I tried running EIGENSOFT but the code crashes. What should I do?
Answer: This is probably a systems issue. Try running the pcatoy program
and if this trivial program crashes, contact your system administrator for help
in tracking down this systems issue. See documentation for details.
Question: Can I run EIGENSOFT on very large data sets?
Answer: Yes. We currently support GWAS data sets up to 8 billion genotypes.
For data sets between 2 billion and 8 billion genotypes, some care is
required. See documentation for details.
Question: I am running with outlier removal on a large data set and the
number of outliers removed seems too large, not only in the first iteration but
in subsequent iterations as well. What should I do?
Answer: The outlier removal approach we have implemented is a heuristic that
seems to work well on data sets up to 1000 samples. For larger data sets,
we recommend either increasing the #sdev threshhold (-s flag in smartpca.perl,
outliersigmathresh parameter in smartpca) above 6.0, or combining your data
set with HapMap data and just removing samples with unusual continental
ancestry along the top two axes. [If choosing the latter, you can reduce
running time by computing PCs using HapMap populations only. See documentation
for details on how to do this.]
Question: Can I use EIGENSTRAT in studies of quantitative traits?
Answer: Yes. See README file in EIGENSTRAT directory.
Question: Can I use EIGENSTRAT in studies involving imputed SNPs?
Answer: At the moment, our code does not support probabilistic genotypes that
may be produced by imputation programs. This is algorithmically
straightforward but due to our limited resources, it may
be awhile before we can provide this upgrade. In the meantime, a possible
solution is to first run PCA on non-imputed SNPs (this will indicate whether
there are ancestry differences between cases and controls) and then run
EIGENSTRAT to compute disease association statistics for all SNPs by sampling
integer-valued genotypes in the case of imputed SNPs.
Question: How long does the code take to run?
Answer: See README file in EIGENSTRAT directory.
Question: The code takes a long time to run on my huge data set. Isn't a
fast eigenvector approximation possible?
Answer: Yes, in theory it is possible to greatly reduce computation time
of top eigenvectors using a fast eigenvector approximation. Unfortunately,
due to our limited resources, we have yet to implement this.
Question: I'm running on an extremely large number of samples and the software
runs out of memory. Why?
Answer: The software uses memory proportional to the square of the number of
samples. In the case of an extremely large number of samples (e.g. >10,000),
the software may run out of memory. The fast eigenvector approximation
described above would actually solve this problem, but is not yet implemented.
Question: When I run I get an error message about "idnames too long".
What should I do?
Answer: The software supports sample ID names up to a max of 39 characters.
Longer sample ID names must be shortened. In addition, if your data is in
PED format, the default is to concatenate the family ID and sample ID names
so that their total length must meet this limit; however, you can set
"familynames: NO" so that only the sample ID name will be used and
must meet the 39 character limit.
Question: Is it possible to obtain the SNP weights for each SNP
along each top eigenvector?
Answer: Yes. See snpweightoutname parameter documented in POPGEN/README.