Navigating through GDAP (Gene-Disease Association Prediction)
Table of contents
The GDAP project follows a modular structure where each component handles a specific part of the gene-disease association prediction pipeline. Here’s how to navigate through the main components:
Data Input
Initialize the pipeline using the
Config
class.Set parameters such as: -
DATA_SOURCE
: Choose between “GraphQLClient
” or “BigQueryClient
” -DISEASE_ID
: Specify the disease ID (EFO format) -PPI_INTERACTIONS
: Maximum number of PPI interactions to consider -NEGATIVE_TO_POSITIVE_RATIO
: Ratio for negative to positive edgesLoad the data using
load_data()
function:
config = Config()
pipeline = GeneDisease(config)
ppi_df, ot_df = pipeline.load_data()
Graph Construction
After loading the data, the graph construction step integrates disease-gene associations and PPI data into a bipartite graph using create_graph()
function:
pipeline.create_graph(ppi_df, ot_df)
- This creates:
pipeline.G
: The main graph objectpipeline.pos_edges
: Positive edge connections (disease-gene associations)pipeline.neg_edges
: Negative edge connections (PPI network)
Embedding Generation
The embedding generation is a crucial step in GDAP that significantly impacts the quality of predictions. Using generate_embeddings()
, GDAP supports multiple embedding methods to be configered in the Config
class:
- Simple node embedding based on degree
- Default Parameters:
- Dimension: 64
- Node2Vec (Node2Vec)
- Default Parameters:
- Walk length: 10
- Components: 32
- Output: Embedding model
n2v_model
and vector embeddingsn2v_wheel_model.bin
- ProNE (ProNE)
- Default Parameters:
- Components: 32
- Parameters: step=5, mu=0.2, theta=0.5
- Output: Embedding model
prone_model
and vector embeddingsprone_wheel_model.bin
- GGVec (GGVec)
- Default Parameters:
- Components: 64
- Order: 3
- Output: Embedding model
ggvec_model
and vector embeddingsggvec_wheel_model.bin
The quality of these embeddings directly affects GDAP’s ability to capture the complex relationships between genes and diseases, making this step vital for accurate predictions.
Feature Extraction
The feature extraction process converts node embeddings into edge features using extract_features_labels()
:
X_train, y_train, X_val, y_val, X_test, y_test, edges_train, edges_val, edges_test = pipeline.extract_features_labels(config.TEST_SIZE)
This step:
- Maps nodes to indices for embedding lookup
- Generates feature matrices and labels
- Splits data into training, validation, and test sets
Model Selection
The model selection and training process is handled by train_and_evaluate_model()
:
model, models_path = pipeline.train_and_evaluate_model(X_train, y_train, X_test, y_test, X_val, y_val)
This step:
- Trains the selected classification model
Evaluates performance using multiple metrics: - Accuracy - Precision - Recall - F1 Score
- Saves the trained model to disk
Model evaluation plots can be generated using:
pipeline.plot_model_evaluation(model, X_val, y_val, models_path)
Model Prediction
Finally, predictions are made using predict_and_save_results()
:
associated_df, non_associated_df = pipeline.predict_and_save_results(model, X_val, edges_val, models_path)
This generates:
- associated_df: Predicted positive associations
- non_associated_df: Predicted negative associations
The results are saved in the following directory structure:
output_dir/
└── disease_name/
├── embedding_wheel/
│ ├── n2v_model
│ ├── n2v_wheel_model.bin
│ └── ...
├── network/
│ ├── graph.graphml
│ ├── negative_edges.npy
│ ├── positive_edges.npy
│ └── ...
├── predictions/
│ ├── associated_prediction_results.csv
│ └── non_associated_results.csv
└── classifier_model/
└── model_name.pkl