Obesity: The Role of Genes

Introduction

Obesity is a medical condition that occurs when a person carries excess weight or body fat that might affect their health. There is a lot of speculation as to what the main cause of obesity is whether it be due to environmental factors or from a person’s genetic composition.
The reason why we have decided to look into obesity is because the rate of obesity has increased rapidly over the years, especially here in the United States.

Figure 1: Graph representing the prevalence of obesity in the United States
Source: https://www.usnews.com/news/data-mine/articles/2018-03-26/sharp-increase-in-obesity-rates-over-last-decade-federal-data-show

The uncertainty of how much obesity is influenced by genetic factors led us to these questions:

Is obesity correlated to the genetic makeup of an individual?
If so, are there specific genes that are correlated with the trait of obesity?
If so, are the SNPs in those genes significant?

By first focusing on the significantly correlated genes, we then made a further analysis on SNPs located on those genes. We assume that if the certain genes are highly correlated with obesity, the SNPs located on those genes should also be significantly correlated to the disease. An even further analysis was conducted by using OR (Odds Ratio) values of the SNPs being investigated. We will assess if that assumption is true by making use of two different but complementary databases, integrating their data, and visualizing those data. This new integrated data table will allow us to answer the questions above as well.

Methods

Knowing that we required databases that show gene-disease and SNP-disease correlations, we decided to utilize and integrate the GWAS Catalog (https://www.ebi.ac.uk/gwas/home) and DisGeNET (http://www.disgenet.org/home/) data bases.

Database 1 - DisGeNET

The DisGeNET database integrates information of human gene-disease associations (GDAs) and variant-disease associations (VDAs) from various repositories including Mendelian, complex and environmental diseases.We looked into the trait of obesity on DISGENET, looking under the Gene Disease Summary category. We then downloaded the .tsv file from there.

Database 2 - GWAS Catalog

GWAS is a database that looks at the different variants of a given gene, trait, or disease. We looked up the IL6 gene on the GWAS catalog browser and downloaded the data table of the gene as a .tsv file.

Data Integration

Packages that would be useful for data cleaning and integration were first loaded into RStudio.
1
2
3
4
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

The two .tsv files were then loaded.

1 2	dis <- read_tsv(file='obes_gda.tsv') gwas <- read_tsv(file='obes_gwas.tsv')

After observing the columns of both tables carefully, only important columns were selected for integration.

glimpse(dis)
dis_selected <- select(dis, Gene, Gene_id, UniProt, Gene_Full_Name, Protein_Class, N_diseases_g,
                       DSI_g, DPI_g, pLI, Score_gda, N_SNPs_gda)
glimpse(gwas)
gwas_selected <- select(gwas, REGION, CHR_ID, CHR_POS, `REPORTED GENE(S)`, `STRONGEST SNP-RISK ALLELE`,
                        SNPS, CONTEXT, `RISK ALLELE FREQUENCY`, `P-VALUE`, PVALUE_MLOG, `OR or BETA`, 
                        `95% CI (TEXT)`)

To make sure every row is unique before joining the two dataset, the command “unique()” was used.
1
2
3
4
gwas_unique <- unique(gwas_selected)
View(gwas_unique)
dis_unique <- unique(dis_selected)
View(dis_unique)
The number of observations remained the same after the code above was run.

The two tables were then joined by using the common column containing gene names.
1
joined = left_join(dis_unique,gwas_unique,by=c('Gene' = 'REPORTED GENE(S)'))

The joined table contained 2018 observations and 22 variables. It was observed that for most of the genes, no SNPs were available. Therefore the rows without any specified SNP were removed i.e. rows containing “NA”, leaving us with 113 observations.
1
final_table = filter(joined,!is.na(SNPS))

The final table was then exported into a .tsv file for further observation and analysis.

1	write.table(final_table, file='final_table.tsv', quote=FALSE, sep='\t', row.names = FALSE)

Attached Files

DisGeNET dataset for obesity (obes_gda.tsv)
GWAS Catalog dataset for obesity (obes_gwas.tsv)
Integrated table (final_table.tsv)

Results

We integrated our two databases successfully by using gene name, which is the common column between the two databases.

What are the genes with high GDA scores?

After filtering the genes for those with a GDA score higher than 0.4, the GDA scores are plotted against their respective genes. The genes with the highest GDA scores in descending order are MC4R, ILEPR, followed by IRS1.

1 2	highgda <- filter(final_table, Score_gda > 0.4) ggplot(highgda, aes(Gene, Score_gda, colour=Score_gda)) + geom_point() + ggtitle("GDA Score vs Gene")

Figure 2: Represents the association of gene and Obesity using GDA scores.

Which chromosomes are those genes located on?

By using the same genes shown in the graph above, the chromosomal locations of those genes were visualized as follows:

1	ggplot(highgda, aes(CHR_ID, Gene, colour=Gene)) + geom_point() + ggtitle("Gene vs Chromosome Number")

Figure 3: Represents which chromosomes the genes are on

Do SNPs on those genes have high p-values?

To determine whether SNPs on those genes do indeed have high p-values, the log of p-values of those SNPs were plotted against their corresponding gene names:

1	ggplot(highgda, aes(Gene, PVALUE_MLOG, colour=PVALUE_MLOG)) + geom_point() + ggtitle("Log SNP P-values vs Corresponding Genes")

Figure 4: Visualizes the correlation between genes with high GDA scores and the p-values of corresponding SNPs

Do SNPs on those genes also have high OR?

This analysis is a further step which was implemented to help further reinforce the fact that a gene is indeed highly associated with obesity.

1 2	colnames(highgda)[21] <- "OR" ggplot(highgda, aes(Gene, OR, colour=OR)) + geom_point() + ggtitle("OR vs Gene")

Figure 5: Visualizes the relationship between OR of SNPs and their corresponding genes with high GDA scores

Discussion

What is GDA score?

GDA score stands for gene-disease association score; the higher the score, the more significant its association is with a particular disease. The GDA score is influenced by the number of curated sources and publications that support that particular GDA. Curated sources from CGI, CLINGEN, GENOMICS ENGLAND, CTD, PSYGENET, ORPHANET, UNIPROT have the highest weight, contributing up to 0.6 score if at least two curated sources support a particular GDA. Further information on how GDA is calculated can be accessed through this link: http://www.disgenet.org/dbinfo

Significance of our analysis

There are a lot of genes that have some sort of association with obesity represented by this plot. From our observation, it seems that the gene most associated with obesity is MC4R. In Figure 3, it depicts the position of the different genes on the individual chromosomes to better identify where these genes are located in the genome. We can better understand where the affected gene that leads to obesity lies in the genome due to this plot. Figure 4 then demonstrates the various SNPs and what genes they appear in. It also indicates the significance of those SNPs by their log of p-values; the higher the value, the more significant it is. In the same graph, the FTO gene is seen to have a large number of significant SNPs overall. However, we saw on our graph in Figure 2 that the FTO gene did not have a strong GDA value. This means that just because there are multiple significant SNPs in a gene that it does not definitively correlate to being associated with a disease. Therefore, our integrated table is important in bridging the knowledge between disease association in a gene and the presence of SNPs in those genes. Using OR values is then a potential further analysis that one might implement to reinforce any sort of significance observed for a particular gene. Odds ratio is the ratio of the probability of having a disease given the presence of an allele over the probability of having a disease given the absence of that allele.

Further utilization of our integrated data table

From our new integrated data table there are many possibilities as to what to do with the new data set. From our integrated data table, the correlation between genes and obesity can be identified a lot more efficiently. Determining the correlation between SNPs and obesity is made easier because there are multiple statistical scores we can consider with the new data table. We can do this given the ability to correlate the GDA values and the p-values. We can also identify which specific SNPs might induce the expression of obesity in certain genes. Another score that provides more dimension to the analysis is the OR value. With all three of these values we could devise a method or strategy for determining the correlation between a gene and a specific trait. This method would utilize the p-values, GDA scores, and the OR values to create a multidimensional analysis of a gene that will report if it is directly the precursor for a specific disease.

Challenges

In the beginning phase of data analysis, we almost used the joined table without first removing the empty rows. Those rows were empty as a result of SNPs not being present in certain genes. We learned that it is important to first carefully observe integrated data as empty rows may be produced as a result of joining two datasets together. Another alternative is to first observe the data tables separately and making the necessary changes before joining them.

Room for improvement

In conclusion, our integrated data table is rich with data but there is definitely a lot that could still be improved. One room for improvement is to integrate gene expression data into our existing data table. Our current data set can explain genes and where they are located in the genome but it does not tell us whether that gene is expressed or not, and if it is, how much it is expressed. Information like this can provide greater context as to which SNPs are actually expressed and which are not. Another improvement could also be to integrate information on co-expression. Given this data set we assume that obesity is only expressed due to one gene whereas realistically there could be many genes that could contribute to its expression.