Configuration
GDAP’s behavior is controlled through the src/config.py
file. This guide explains how to customize the pipeline parameters.
Configuration File
class Config:
DISEASE_ID = "EFO_0000319" # Cardiovascular disease
DATA_SOURCE = "GraphQLClient" # GraphQLClient or BigQueryClient
QUERY = INDIRECT_SCORES # DIRECT_SCORES or INDIRECT_SCORES
PPI_INTERACTIONS = 5000000 # Max PPI interactions
NEGATIVE_TO_POSITIVE_RATIO = 10 # Negative/positive sample ratio
EMBEDDING_MODE = "GGVec" # Node2Vec, ProNE, GGVec, degree_avg
MODEL_NAME = "Logistic_Regression" # ML model to use
TEST_SIZE = 0.2 # Train/test split ratio
OUTPUT_DIR = PROJ_ROOT / "results/"
CREDENTIALS = PROJ_ROOT / "configs/stemaway-d5a18133ff83.json"
Key Parameters
Disease Selection
DISEASE_ID: EFO (Experimental Factor Ontology) identifier for your target disease.
# Examples
DISEASE_ID = "EFO_0000319" # Cardiovascular disease
DISEASE_ID = "MONDO_0004979" # Asthma
Find disease EFOs at the Open Targets Platform, You can search by name or by an Ontology ID.
Data Source Configuration
DATA_SOURCE: Choose between GraphQL API and BigQuery.
# GraphQL API (default, no credentials needed)
DATA_SOURCE = "GraphQLClient"
# BigQuery (requires Google Cloud credentials)
DATA_SOURCE = "BigQueryClient"
QUERY: For BigQuery, choose query type:
DIRECT_SCORES
: Direct disease-target associationsINDIRECT_SCORES
: Indirect associations with evidence
Embedding Configuration
EMBEDDING_MODE: Choose embedding algorithm:
EMBEDDING_MODE = "GGVec" # Graph-based global vectors (default)
EMBEDDING_MODE = "Node2Vec" # Random walk-based embeddings
EMBEDDING_MODE = "ProNE" # Proximity-preserving embeddings
EMBEDDING_MODE = "degree_avg" # Simple degree-based features
Model Configuration
MODEL_NAME: Choose machine learning model:
MODEL_NAME = "Logistic_Regression" # Linear classifier (default)
MODEL_NAME = "Random_Forest" # Ensemble tree model
MODEL_NAME = "SVM" # Support Vector Machine
MODEL_NAME = "Gradient_Boosting" # Gradient boosting trees
Data Processing
PPI_INTERACTIONS: Maximum number of protein-protein interactions to load:
PPI_INTERACTIONS = 5000000 # 5M interactions (default)
PPI_INTERACTIONS = 1000000 # 1M interactions (faster, less memory)
NEGATIVE_TO_POSITIVE_RATIO: Balance between positive and negative samples:
NEGATIVE_TO_POSITIVE_RATIO = 10 # 10:1 negative to positive (default)
NEGATIVE_TO_POSITIVE_RATIO = 5 # 5:1 ratio
TEST_SIZE: Train/test split ratio:
TEST_SIZE = 0.2 # 20% for testing (default)
TEST_SIZE = 0.3 # 30% for testing
Example Configurations
Quick Test Configuration
class Config:
DISEASE_ID = "EFO_0000319"
DATA_SOURCE = "GraphQLClient"
PPI_INTERACTIONS = 1000000 # Reduced for faster processing
EMBEDDING_MODE = "degree_avg" # Simple embeddings
MODEL_NAME = "Logistic_Regression"
TEST_SIZE = 0.2
High-Performance Configuration
class Config:
DISEASE_ID = "EFO_0000319"
DATA_SOURCE = "BigQueryClient" # For large datasets
QUERY = INDIRECT_SCORES
PPI_INTERACTIONS = 10000000 # More interactions
EMBEDDING_MODE = "GGVec"
MODEL_NAME = "Random_Forest"
NEGATIVE_TO_POSITIVE_RATIO = 15
BigQuery Setup
If using BigQuery, you need Google Cloud credentials:
- Create a Google Cloud project
- Enable BigQuery API
- Create a service account and download JSON key
- Place the JSON file in
configs/
directory - Update the
CREDENTIALS
path in config.py
Environment Variables
You can also use environment variables for sensitive configuration:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
export DISEASE_ID="EFO_0000319"
Google Cloud Setup
- Create a service account in Google Cloud with the necessary roles (BigQuery User)
- Go to the Google Cloud Console.
- Select your project or create a new one if needed.
- In the navigation menu, go to IAM & Admin > Service Accounts.
- Create a new Service Account or use an existing one.
- Grant the necessary roles (e.g., BigQuery User) to this service account ID.
- Create a key for this service account in JSON format.
- Download the service account JSON key.
- After creating the service account and generating the key, download the JSON key file to your local machine. This file will be used for authentication.
- Set Up Google Cloud Credentials Locally (using
.env
ordirect path
)
Option 1: Using
.env
File- Create a
.env
file in your project’s root directory (if it doesn’t already exist). Add the full path to your service account JSON key in the .env file:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/json/file
Install the
python-dotenv
package if it isn’t already installed. You can do this by running:pip install python-dotenv
- Create a
Option 2: Using the environment variable GOOGLE_APPLICATION_CREDENTIALS
You can set this in your terminal or provide the
credentials_path
directly when initializing the clientTo set the GOOGLE_APPLICATION_CREDENTIALS environment variable through terminal, run these commands:
export GOOGLE_APPLICATION_CREDENTIALS='/path/to/json/file'
echo $GOOGLE_APPLICATION_CREDENTIALS # Verify it unset GOOGLE_APPLICATION_CREDENTIALS # Unset it
Provide the credentials path directly in your code when initializing the BigQuery client
- Run the main file to fetch the desired disease dataset
Deployment
- Add the key file to your local app secrets
.streamlit/secrets.toml
# .streamlit/secrets.toml
[gcp_service_account]
type = "service_account"
project_id = "xxx"
private_key_id = "xxx"
private_key = "xxx"
client_email = "xxx"
client_id = "xxx"
auth_uri = "https://accounts.google.com/o/oauth2/auth"
token_uri = "https://oauth2.googleapis.com/token"
auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
client_x509_cert_url = "xxx"
- Copy your app secrets to the cloud Go to the app dashboard and in the app’s dropdown menu, click on Edit Secrets. Copy the content of secrets.toml into the text area.