9.1 KiB
9.1 KiB
FScanpy Commit Code Architecture
Code Manifest
data_pre/mfe.pyseaphage_knn.py
model_feature/feature_analysis.py
train_models/bilstm_cnn.pyhist_gb.py
utils/config.pyfunction.py
Overview & Pipeline
This repository is organized around the main pipeline: "Read and preprocess sequence data → Construct features and train models → Analyze importance and save results."
- Data Preprocessing (
data_pre/)- Calculate and write back structural energy features (MFE), and filter samples based on external confidence scores and KNN distance.
- Model Training (
train_models/)- The traditional Gradient Boosting model (HistGradientBoosting) uses explicit features (mono/di/tri-nucleotides + MFE).
- The deep learning model (BiLSTM-CNN) learns representations end-to-end from encoded sequences and supports iterative self-training.
- Feature Importance Analysis (
model_feature/)- For the GB model, perform Permutation Importance and SHAP analysis.
- For the BiLSTM-CNN model, perform Integrated Gradients and Saliency Map analysis.
- Common Configuration and Utilities (
utils/)- Contains general-purpose functions for path/directory configuration, data loading, evaluation, and results saving.
data_pre/
data_pre/mfe.py
- Purpose:
- Calls ViennaRNA (
import RNA) to calculate the Minimum Free Energy (MFE) for given sequence windows and writes the results back to a CSV file.
- Calls ViennaRNA (
- Logic:
- For each
full_seq, it crops subsequences based on configuration (default start: 198, lengths: 40 and 120). - Uses
RNA.fold_compound(sub_seq).mfe()to calculate MFE and populates the specified columns (mfe_40bp,mfe_120bp).
- For each
- Key Functions:
predict_rna_structure_multiple(csv_file, predictions): Writes back columns for multiple window configurations.calculate_mfe_features(data_file, output_file=None): The standard entry point that assembles predictions and calls the above function.
- Input/Output:
- Input: A CSV file containing
full_seq(e.g.,BaseConfig.VALIDATION_DATA). - Output: Appends two columns,
mfe_40bpandmfe_120bp, to the original CSV and saves it.
- Input: A CSV file containing
- Dependencies:
- ViennaRNA Python interface (
RNA). pandas.
- ViennaRNA Python interface (
data_pre/seaphage_knn.py
- Purpose:
- Selects medium/low-confidence neighbor samples based on externally provided
SEAPHAGESsubset confidence rankings (final_rank→confidence) and the one-hot encoded distance of sequences. These are then merged with high-confidence samples to create an augmented training set.
- Selects medium/low-confidence neighbor samples based on externally provided
- Logic:
- Reads
BaseConfig.TRAIN_DATA,BaseConfig.TEST_DATA, andBaseConfig.SEAPHAGE_PROB(requires configuration inBaseConfig). - Extracts samples from
SEAPHAGESwithlabel==1, aligns them with the probability table using theDNA_seqidprefix to generateconfidence. - Converts
FS_periodsequences to one-hot encoding, standardizes them, and uses high-confidence samples as a reference library. It then calculates the average KNN distance for medium/low-confidence samples and filters them based on a quantile threshold. - Merges and annotates with
confidence_level(high/medium/low), then saves to a specified CSV.
- Reads
- Input/Output:
- Input: Training/testing CSVs and the
seaphage_probprobability table. - Output: The filtered
seaphage_selected.csv(Note: the output path is currently hardcoded and should be managed byBaseConfig).
- Input: Training/testing CSVs and the
- Dependencies:
pandas,numpy,scikit-learn(NearestNeighbors,StandardScaler).
train_models/
train_models/hist_gb.py
- Purpose:
- Uses
HistGradientBoostingClassifierfor frameshift site classification, training, and evaluation. It supports constructing explicit features fromfull_seqand adding MFE features.
- Uses
- Logic and Features:
- A central crop of
GBConfig.SEQUENCE_LENGTH=33is used to maintain the reading frame, constructing:- Mononucleotide one-hot features (4×L).
- Dinucleotide features (16×(L-1)).
- Trinucleotide (codon) features (64×(L-2)).
- Structural energy features:
mfe_40bp,mfe_120bp.
- A central crop of
- Key Functions:
sequence_to_features(),prepare_data(): Generate the feature matrix and weights fromfull_seqand MFE columns.train_hist_model(): Handles training, validation split, and evaluation (on test set and externalXu/Atkinssets).analyze_feature_importance(): Exports built-in feature importance to a CSV file.
- Input/Output:
- Input: Merged data from
BaseConfig.DATA_DIR(merged_train_data.csv,merged_test_data.csv, etc.). - Output: Model object, evaluation metrics, and an importance CSV (saved to
BaseConfig.GB_DIR).
- Input: Merged data from
train_models/bilstm_cnn.py
- Purpose:
- An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers:
Input→Embedding→BiLSTM→Parallel CNN Branches→Concatenation→Dense Layers→Sigmoid Output. - Supports self-training: iteratively selects pseudo-labeled samples from a pool of low-confidence samples to add to the training set.
- An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers:
- Logic:
- Sequence Encoding:
encode_sequence()converts 'ATCG' to {1,2,3,4} and pads/trims toConfig.Sequence_len=399. - Training Monitoring:
MetricsCallbackcalculates test set metrics at each epoch to track the best-performing model. - Self-training Loop: Calls
utils.function.select_low_confidence_samples_cnn()to select samples based on model probability and a givenfinal_prob.
- Sequence Encoding:
- Key Functions:
create_bilstm_cnn_model(),prepare_data(),train_bilstm_cnn_model().main(): Loads data, trains the model, and saves the best and final models along with training information usingsave_training_info().
- Input/Output:
- Input: Train/test sets returned by
load_data(), and optional external validation sets (Xu/Atkins). - Output: Saves
*.h5model,*_training_info.json, and*_weights.pkltoBaseConfig.BILSTM_MODEL_DIR.
- Input: Train/test sets returned by
model_feature/
model_feature/feature_analysis.py
- Purpose:
- Provides a unified interface for feature importance analysis:
- GB Model: Permutation Importance (
sklearn.inspection.permutation_importance) and SHAP. - BiLSTM-CNN Model: Integrated Gradients and Saliency Maps (Gradient-based).
- GB Model: Permutation Importance (
- Provides a unified interface for feature importance analysis:
- Logic:
- Reads trained models and validation sets, encodes data according to each model's pipeline, calculates importance, and saves results to separate files and a summary JSON.
- Key Classes/Methods:
FeatureImportanceAnalyzer: A class that encapsulates model/data loading, feature preparation, and various importance methods.run_all_analyses(): A single command to run all analyses and save results tooutput_dir/{gb_model,bilstm_model,combined_analysis}.
- Note:
- Import paths are currently written as
from models.hist_gb ...andfrom models.bilstm_cnn ..., but the actual files are intrain_models/. This inconsistency needs to be fixed before running, either by correcting the paths or creating a package with that name.
- Import paths are currently written as
utils/
utils/config.py
- Purpose:
- Centralizes the management of paths for data, models, and results. Provides
create_directories()to ensure that directories exist.
- Centralizes the management of paths for data, models, and results. Provides
- Note:
- Currently contains placeholder paths (e.g.,
/path/to/...). These must be modified according to the environment before execution, includingDATA_DIR,TRAIN_DATA,RESULT_DIR, etc.
- Currently contains placeholder paths (e.g.,
utils/function.py
- Purpose:
- Common Utilities:
- Self-training Sample Selection (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external
final_probconfidence threshold. - Save Training Results and Models:
save_training_info()saves the.h5model, a training info JSON, and weights pkl simultaneously. - Data Loading:
load_data()merges and validates columns (full_seq,label,source) and downsamplesEUPLOTESnegative samples as needed. - Evaluation:
evaluate_model_gb()andevaluate_model_cnn()calculate common metrics and logloss.
- Self-training Sample Selection (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external
- Common Utilities:
Interactions & Flow
- Use
data_pre/mfe.pyto calculate and write MFE columns to the data CSVs. - Use
data_pre/seaphage_knn.pyto filter and supplement training samples based on confidence and KNN. - Training:
- GB (
train_models/hist_gb.py): Construct explicit features fromfull_seq+mfe_*for training and evaluation. - BiLSTM-CNN (
train_models/bilstm_cnn.py): End-to-end training on encoded sequences using a hybrid BiLSTM-CNN architecture, with support for iterative self-training.
- GB (
- Analysis:
model_feature/feature_analysis.pyoutputs feature/positional importance and a summary from various methods.