This is the main validation set as used in the second CADD publication. The two variant set are based on A) SNV designated as pathogenic (CLNSIG=Pathogenic;) in ClinVar (version 20180729 for GRCh37) B) SNV from ExAC (release1) with mean allele frequency greater 0.05 Both set were lifted between GRCh37 and GRCh38 and any variants excluded that did not map to a consensus chromosome sequence (Chromosomes 1-22,X) in both builds and the two overlapping variants inbetween the two set were removed. These set (labelled 'all') were used in Fig. 2A and Suppl. Fig. S1B. Fig. 2B (set labelled 'missense') We then annotated all variants using Ensembl VEP (GRCh37, db version 92) and selected all variants annotated as missense. We count the number of variants per annotated gene (genes that are annotated to the same variants are grouped together) for each of the two set and then select the minimum number of variants for each gene in the two set. Suppl. Fig. S1A All variants of the 'all' set that are found in FunSeq2 and LINSIGHT whole genome files (which are supposedly only defined for non-coding variants). CADD GRCh38-v1.4 is always scored by lifting variants to GRCh38 and then lifting scores back to GRCh37 For further validation set as shown in the release note document, please see the validation set for v1.3