9.1 KiB

Raw Blame History

FScanpy Commit Code Architecture

Code Manifest

data_pre/
- mfe.py
- seaphage_knn.py
model_feature/
- feature_analysis.py
train_models/
- bilstm_cnn.py
- hist_gb.py
utils/
- config.py
- function.py

Overview & Pipeline

This repository is organized around the main pipeline: "Read and preprocess sequence data → Construct features and train models → Analyze importance and save results."

Data Preprocessing (data_pre/)
- Calculate and write back structural energy features (MFE), and filter samples based on external confidence scores and KNN distance.
Model Training (train_models/)
- The traditional Gradient Boosting model (HistGradientBoosting) uses explicit features (mono/di/tri-nucleotides + MFE).
- The deep learning model (BiLSTM-CNN) learns representations end-to-end from encoded sequences and supports iterative self-training.
Feature Importance Analysis (model_feature/)
- For the GB model, perform Permutation Importance and SHAP analysis.
- For the BiLSTM-CNN model, perform Integrated Gradients and Saliency Map analysis.
Common Configuration and Utilities (utils/)
- Contains general-purpose functions for path/directory configuration, data loading, evaluation, and results saving.

`data_pre/`

`data_pre/mfe.py`

Purpose:
- Calls ViennaRNA (import RNA) to calculate the Minimum Free Energy (MFE) for given sequence windows and writes the results back to a CSV file.
Logic:
- For each full_seq, it crops subsequences based on configuration (default start: 198, lengths: 40 and 120).
- Uses RNA.fold_compound(sub_seq).mfe() to calculate MFE and populates the specified columns (mfe_40bp, mfe_120bp).
Key Functions:
- predict_rna_structure_multiple(csv_file, predictions): Writes back columns for multiple window configurations.
- calculate_mfe_features(data_file, output_file=None): The standard entry point that assembles predictions and calls the above function.
Input/Output:
- Input: A CSV file containing full_seq (e.g., BaseConfig.VALIDATION_DATA).
- Output: Appends two columns, mfe_40bp and mfe_120bp, to the original CSV and saves it.
Dependencies:
- ViennaRNA Python interface (RNA).
- pandas.

`data_pre/seaphage_knn.py`

Purpose:
- Selects medium/low-confidence neighbor samples based on externally provided SEAPHAGES subset confidence rankings (final_rank → confidence) and the one-hot encoded distance of sequences. These are then merged with high-confidence samples to create an augmented training set.
Logic:
- Reads BaseConfig.TRAIN_DATA, BaseConfig.TEST_DATA, and BaseConfig.SEAPHAGE_PROB (requires configuration in BaseConfig).
- Extracts samples from SEAPHAGES with label==1, aligns them with the probability table using the DNA_seqid prefix to generate confidence.
- Converts FS_period sequences to one-hot encoding, standardizes them, and uses high-confidence samples as a reference library. It then calculates the average KNN distance for medium/low-confidence samples and filters them based on a quantile threshold.
- Merges and annotates with confidence_level (high/medium/low), then saves to a specified CSV.
Input/Output:
- Input: Training/testing CSVs and the seaphage_prob probability table.
- Output: The filtered seaphage_selected.csv (Note: the output path is currently hardcoded and should be managed by BaseConfig).
Dependencies:
- pandas, numpy, scikit-learn (NearestNeighbors, StandardScaler).

`train_models/`

`train_models/hist_gb.py`

Purpose:
- Uses HistGradientBoostingClassifier for frameshift site classification, training, and evaluation. It supports constructing explicit features from full_seq and adding MFE features.
Logic and Features:
- A central crop of GBConfig.SEQUENCE_LENGTH=33 is used to maintain the reading frame, constructing:
  - Mononucleotide one-hot features (4×L).
  - Dinucleotide features (16×(L-1)).
  - Trinucleotide (codon) features (64×(L-2)).
  - Structural energy features: mfe_40bp, mfe_120bp.
Key Functions:
- sequence_to_features(), prepare_data(): Generate the feature matrix and weights from full_seq and MFE columns.
- train_hist_model(): Handles training, validation split, and evaluation (on test set and external Xu/Atkins sets).
- analyze_feature_importance(): Exports built-in feature importance to a CSV file.
Input/Output:
- Input: Merged data from BaseConfig.DATA_DIR (merged_train_data.csv, merged_test_data.csv, etc.).
- Output: Model object, evaluation metrics, and an importance CSV (saved to BaseConfig.GB_DIR).

`train_models/bilstm_cnn.py`

Purpose:
- An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers: Input → Embedding → BiLSTM → Parallel CNN Branches → Concatenation → Dense Layers → Sigmoid Output.
- Supports self-training: iteratively selects pseudo-labeled samples from a pool of low-confidence samples to add to the training set.
Logic:
- Sequence Encoding: encode_sequence() converts 'ATCG' to {1,2,3,4} and pads/trims to Config.Sequence_len=399.
- Training Monitoring: MetricsCallback calculates test set metrics at each epoch to track the best-performing model.
- Self-training Loop: Calls utils.function.select_low_confidence_samples_cnn() to select samples based on model probability and a given final_prob.
Key Functions:
- create_bilstm_cnn_model(), prepare_data(), train_bilstm_cnn_model().
- main(): Loads data, trains the model, and saves the best and final models along with training information using save_training_info().
Input/Output:
- Input: Train/test sets returned by load_data(), and optional external validation sets (Xu/Atkins).
- Output: Saves *.h5 model, *_training_info.json, and *_weights.pkl to BaseConfig.BILSTM_MODEL_DIR.

`model_feature/`

`model_feature/feature_analysis.py`

Purpose:
- Provides a unified interface for feature importance analysis:
  - GB Model: Permutation Importance (sklearn.inspection.permutation_importance) and SHAP.
  - BiLSTM-CNN Model: Integrated Gradients and Saliency Maps (Gradient-based).
Logic:
- Reads trained models and validation sets, encodes data according to each model's pipeline, calculates importance, and saves results to separate files and a summary JSON.
Key Classes/Methods:
- FeatureImportanceAnalyzer: A class that encapsulates model/data loading, feature preparation, and various importance methods.
- run_all_analyses(): A single command to run all analyses and save results to output_dir/{gb_model,bilstm_model,combined_analysis}.
Note:
- Import paths are currently written as from models.hist_gb ... and from models.bilstm_cnn ..., but the actual files are in train_models/. This inconsistency needs to be fixed before running, either by correcting the paths or creating a package with that name.

`utils/`

`utils/config.py`

Purpose:
- Centralizes the management of paths for data, models, and results. Provides create_directories() to ensure that directories exist.
Note:
- Currently contains placeholder paths (e.g., /path/to/...). These must be modified according to the environment before execution, including DATA_DIR, TRAIN_DATA, RESULT_DIR, etc.

`utils/function.py`

Purpose:
- Common Utilities:
  - Self-training Sample Selection (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external final_prob confidence threshold.
  - Save Training Results and Models: save_training_info() saves the .h5 model, a training info JSON, and weights pkl simultaneously.
  - Data Loading: load_data() merges and validates columns (full_seq, label, source) and downsamples EUPLOTES negative samples as needed.
  - Evaluation: evaluate_model_gb() and evaluate_model_cnn() calculate common metrics and logloss.

Interactions & Flow

Use data_pre/mfe.py to calculate and write MFE columns to the data CSVs.
Use data_pre/seaphage_knn.py to filter and supplement training samples based on confidence and KNN.
Training:
- GB (train_models/hist_gb.py): Construct explicit features from full_seq + mfe_* for training and evaluation.
- BiLSTM-CNN (train_models/bilstm_cnn.py): End-to-end training on encoded sequences using a hybrid BiLSTM-CNN architecture, with support for iterative self-training.
Analysis: model_feature/feature_analysis.py outputs feature/positional importance and a summary from various methods.

9.1 KiB Raw Blame History Unescape Escape

FScanpy Commit Code Architecture

Code Manifest

Overview & Pipeline

data_pre/

data_pre/mfe.py

data_pre/seaphage_knn.py

train_models/

train_models/hist_gb.py

train_models/bilstm_cnn.py

model_feature/

model_feature/feature_analysis.py

utils/

utils/config.py

utils/function.py

Interactions & Flow

9.1 KiB

Raw Blame History

`data_pre/`

`data_pre/mfe.py`

`data_pre/seaphage_knn.py`

`train_models/`

`train_models/hist_gb.py`

`train_models/bilstm_cnn.py`

`model_feature/`

`model_feature/feature_analysis.py`

`utils/`

`utils/config.py`

`utils/function.py`