Navigating through GDAP

Navigating through GDAP (Gene-Disease Association Prediction)

Table of contents

Data Input
Graph Construction
Embedding Generation
Feature Extraction
Model Selection
Model Prediction

The GDAP project follows a modular structure where each component handles a specific part of the gene-disease association prediction pipeline. Here’s how to navigate through the main components:

Data Input

Initialize the pipeline using the Config class.
Set parameters such as: - DATA_SOURCE: Choose between “GraphQLClient” or “BigQueryClient” - DISEASE_ID: Specify the disease ID (EFO format) - PPI_INTERACTIONS: Maximum number of PPI interactions to consider - NEGATIVE_TO_POSITIVE_RATIO: Ratio for negative to positive edges
Load the data using load_data() function:

  config = Config()
  pipeline = GeneDisease(config)
  ppi_df, ot_df = pipeline.load_data()

Graph Construction

After loading the data, the graph construction step integrates disease-gene associations and PPI data into a bipartite graph using create_graph() function:

  pipeline.create_graph(ppi_df, ot_df)

This creates:
- pipeline.G: The main graph object
- pipeline.pos_edges: Positive edge connections (disease-gene associations)
- pipeline.neg_edges: Negative edge connections (PPI network)

Embedding Generation

The embedding generation is a crucial step in GDAP that significantly impacts the quality of predictions. Using generate_embeddings(), GDAP supports multiple embedding methods to be configered in the Config class:

Simple node embedding based on degree
- Default Parameters:
- Dimension: 64
Node2Vec (Node2Vec)
- Default Parameters:
- Walk length: 10
- Components: 32
- Output: Embedding model n2v_model and vector embeddings n2v_wheel_model.bin
ProNE (ProNE)
- Default Parameters:
- Components: 32
- Parameters: step=5, mu=0.2, theta=0.5
- Output: Embedding model prone_model and vector embeddings prone_wheel_model.bin
GGVec (GGVec)
- Default Parameters:
- Components: 64
- Order: 3
- Output: Embedding model ggvec_model and vector embeddings ggvec_wheel_model.bin

The quality of these embeddings directly affects GDAP’s ability to capture the complex relationships between genes and diseases, making this step vital for accurate predictions.

Feature Extraction

The feature extraction process converts node embeddings into edge features using extract_features_labels():

  X_train, y_train, X_val, y_val, X_test, y_test, edges_train, edges_val, edges_test = pipeline.extract_features_labels(config.TEST_SIZE)

This step:

Maps nodes to indices for embedding lookup
Generates feature matrices and labels
Splits data into training, validation, and test sets

Model Selection

The model selection and training process is handled by train_and_evaluate_model():

  model, models_path = pipeline.train_and_evaluate_model(X_train, y_train, X_test, y_test, X_val, y_val)

This step:

Trains the selected classification model
Evaluates performance using multiple metrics: - Accuracy - Precision - Recall - F1 Score
Saves the trained model to disk

Model evaluation plots can be generated using:

  pipeline.plot_model_evaluation(model, X_val, y_val, models_path)

Model Prediction

Finally, predictions are made using predict_and_save_results():

  associated_df, non_associated_df = pipeline.predict_and_save_results(model, X_val, edges_val, models_path)

This generates:

associated_df: Predicted positive associations
non_associated_df: Predicted negative associations

The results are saved in the following directory structure:

  output_dir/
  └── disease_name/
      ├── embedding_wheel/
      │   ├── n2v_model
      │   ├── n2v_wheel_model.bin
      │   └── ...
      ├── network/
      │   ├── graph.graphml
      │   ├── negative_edges.npy
      │   ├── positive_edges.npy
      │   └── ...
      ├── predictions/
      │   ├── associated_prediction_results.csv
      │   └── non_associated_results.csv
      └── classifier_model/
          └── model_name.pkl