EIGENSOFT: Frequently Asked Questions

Question: Is a Mac or Windows version of EIGENSOFT available?
Answer: No. Due to our limited resources we are only able to support Linux at this time.

Question: Is documentation available for EIGENSOFT?
Answer: Yes. See main README file included in the software release, and additional documentation files referenced therein.

Question: What's inside the 3 directories CONVERTF, POPGEN, and EIGENSTRAT?
Answer: The CONVERTF directory contains documentation and examples of our convertf program for converting file formats. The POPGEN directory contains documentation and examples of our smartpca program for running PCA. The EIGENSTRAT directory contains documentation and examples of correcting for population stratification in disease studies using the EIGENSTRAT approach, as well as a PERL wrapper smartpca.perl for running the smartpca program.

Question: I tried running EIGENSOFT but the code crashes. What should I do?
Answer: This is probably a systems issue. Try running the pcatoy program and if this trivial program crashes, contact your system administrator for help in tracking down this systems issue. See documentation for details.

Question: Can I run EIGENSOFT on very large data sets?
Answer: Yes. We currently support GWAS data sets up to 8 billion genotypes. For data sets between 2 billion and 8 billion genotypes, some care is required. See documentation for details.

Question: I am running with outlier removal on a large data set and the number of outliers removed seems too large, not only in the first iteration but in subsequent iterations as well. What should I do?
Answer: The outlier removal approach we have implemented is a heuristic that seems to work well on data sets up to 1000 samples. For larger data sets, we recommend either increasing the #sdev threshhold (-s flag in smartpca.perl, outliersigmathresh parameter in smartpca) above 6.0, or combining your data set with HapMap data and just removing samples with unusual continental ancestry along the top two axes. [If choosing the latter, you can reduce running time by computing PCs using HapMap populations only. See documentation for details on how to do this.]

Question: Can I use EIGENSTRAT in studies of quantitative traits?
Answer: Yes. See README file in EIGENSTRAT directory.

Question: Can I use EIGENSTRAT in studies involving imputed SNPs?
Answer: At the moment, our code does not support probabilistic genotypes that may be produced by imputation programs. This is algorithmically straightforward but due to our limited resources, it may be awhile before we can provide this upgrade. In the meantime, a possible solution is to first run PCA on non-imputed SNPs (this will indicate whether there are ancestry differences between cases and controls) and then run EIGENSTRAT to compute disease association statistics for all SNPs by sampling integer-valued genotypes in the case of imputed SNPs.

Question: How long does the code take to run?
Answer: See README file in EIGENSTRAT directory.

Question: The code takes a long time to run on my huge data set. Isn't a fast eigenvector approximation possible?
Answer: Yes, in theory it is possible to greatly reduce computation time of top eigenvectors using a fast eigenvector approximation. Unfortunately, due to our limited resources, we have yet to implement this.

Question: I'm running on an extremely large number of samples and the software runs out of memory. Why?
Answer: The software uses memory proportional to the square of the number of samples. In the case of an extremely large number of samples (e.g. >10,000), the software may run out of memory. The fast eigenvector approximation described above would actually solve this problem, but is not yet implemented.

Question: When I run I get an error message about "idnames too long". What should I do?
Answer: The software supports sample ID names up to a max of 39 characters. Longer sample ID names must be shortened. In addition, if your data is in PED format, the default is to concatenate the family ID and sample ID names so that their total length must meet this limit; however, you can set "familynames: NO" so that only the sample ID name will be used and must meet the 39 character limit.

Question: Is it possible to obtain the SNP weights for each SNP along each top eigenvector?
Answer: Yes. See snpweightoutname parameter documented in POPGEN/README.