FScanpy-commit-code/ARCH.md

9.1 KiB
Raw Blame History

FScanpy Commit Code Architecture

Code Manifest

  • data_pre/
    • mfe.py
    • seaphage_knn.py
  • model_feature/
    • feature_analysis.py
  • train_models/
    • bilstm_cnn.py
    • hist_gb.py
  • utils/
    • config.py
    • function.py

Overview & Pipeline

This repository is organized around the main pipeline: "Read and preprocess sequence data → Construct features and train models → Analyze importance and save results."

  1. Data Preprocessing (data_pre/)
    • Calculate and write back structural energy features (MFE), and filter samples based on external confidence scores and KNN distance.
  2. Model Training (train_models/)
    • The traditional Gradient Boosting model (HistGradientBoosting) uses explicit features (mono/di/tri-nucleotides + MFE).
    • The deep learning model (BiLSTM-CNN) learns representations end-to-end from encoded sequences and supports iterative self-training.
  3. Feature Importance Analysis (model_feature/)
    • For the GB model, perform Permutation Importance and SHAP analysis.
    • For the BiLSTM-CNN model, perform Integrated Gradients and Saliency Map analysis.
  4. Common Configuration and Utilities (utils/)
    • Contains general-purpose functions for path/directory configuration, data loading, evaluation, and results saving.

data_pre/

data_pre/mfe.py

  • Purpose:
    • Calls ViennaRNA (import RNA) to calculate the Minimum Free Energy (MFE) for given sequence windows and writes the results back to a CSV file.
  • Logic:
    • For each full_seq, it crops subsequences based on configuration (default start: 198, lengths: 40 and 120).
    • Uses RNA.fold_compound(sub_seq).mfe() to calculate MFE and populates the specified columns (mfe_40bp, mfe_120bp).
  • Key Functions:
    • predict_rna_structure_multiple(csv_file, predictions): Writes back columns for multiple window configurations.
    • calculate_mfe_features(data_file, output_file=None): The standard entry point that assembles predictions and calls the above function.
  • Input/Output:
    • Input: A CSV file containing full_seq (e.g., BaseConfig.VALIDATION_DATA).
    • Output: Appends two columns, mfe_40bp and mfe_120bp, to the original CSV and saves it.
  • Dependencies:
    • ViennaRNA Python interface (RNA).
    • pandas.

data_pre/seaphage_knn.py

  • Purpose:
    • Selects medium/low-confidence neighbor samples based on externally provided SEAPHAGES subset confidence rankings (final_rankconfidence) and the one-hot encoded distance of sequences. These are then merged with high-confidence samples to create an augmented training set.
  • Logic:
    • Reads BaseConfig.TRAIN_DATA, BaseConfig.TEST_DATA, and BaseConfig.SEAPHAGE_PROB (requires configuration in BaseConfig).
    • Extracts samples from SEAPHAGES with label==1, aligns them with the probability table using the DNA_seqid prefix to generate confidence.
    • Converts FS_period sequences to one-hot encoding, standardizes them, and uses high-confidence samples as a reference library. It then calculates the average KNN distance for medium/low-confidence samples and filters them based on a quantile threshold.
    • Merges and annotates with confidence_level (high/medium/low), then saves to a specified CSV.
  • Input/Output:
    • Input: Training/testing CSVs and the seaphage_prob probability table.
    • Output: The filtered seaphage_selected.csv (Note: the output path is currently hardcoded and should be managed by BaseConfig).
  • Dependencies:
    • pandas, numpy, scikit-learn (NearestNeighbors, StandardScaler).

train_models/

train_models/hist_gb.py

  • Purpose:
    • Uses HistGradientBoostingClassifier for frameshift site classification, training, and evaluation. It supports constructing explicit features from full_seq and adding MFE features.
  • Logic and Features:
    • A central crop of GBConfig.SEQUENCE_LENGTH=33 is used to maintain the reading frame, constructing:
      • Mononucleotide one-hot features (4×L).
      • Dinucleotide features (16×(L-1)).
      • Trinucleotide (codon) features (64×(L-2)).
      • Structural energy features: mfe_40bp, mfe_120bp.
  • Key Functions:
    • sequence_to_features(), prepare_data(): Generate the feature matrix and weights from full_seq and MFE columns.
    • train_hist_model(): Handles training, validation split, and evaluation (on test set and external Xu/Atkins sets).
    • analyze_feature_importance(): Exports built-in feature importance to a CSV file.
  • Input/Output:
    • Input: Merged data from BaseConfig.DATA_DIR (merged_train_data.csv, merged_test_data.csv, etc.).
    • Output: Model object, evaluation metrics, and an importance CSV (saved to BaseConfig.GB_DIR).

train_models/bilstm_cnn.py

  • Purpose:
    • An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers: InputEmbeddingBiLSTMParallel CNN BranchesConcatenationDense LayersSigmoid Output.
    • Supports self-training: iteratively selects pseudo-labeled samples from a pool of low-confidence samples to add to the training set.
  • Logic:
    • Sequence Encoding: encode_sequence() converts 'ATCG' to {1,2,3,4} and pads/trims to Config.Sequence_len=399.
    • Training Monitoring: MetricsCallback calculates test set metrics at each epoch to track the best-performing model.
    • Self-training Loop: Calls utils.function.select_low_confidence_samples_cnn() to select samples based on model probability and a given final_prob.
  • Key Functions:
    • create_bilstm_cnn_model(), prepare_data(), train_bilstm_cnn_model().
    • main(): Loads data, trains the model, and saves the best and final models along with training information using save_training_info().
  • Input/Output:
    • Input: Train/test sets returned by load_data(), and optional external validation sets (Xu/Atkins).
    • Output: Saves *.h5 model, *_training_info.json, and *_weights.pkl to BaseConfig.BILSTM_MODEL_DIR.

model_feature/

model_feature/feature_analysis.py

  • Purpose:
    • Provides a unified interface for feature importance analysis:
      • GB Model: Permutation Importance (sklearn.inspection.permutation_importance) and SHAP.
      • BiLSTM-CNN Model: Integrated Gradients and Saliency Maps (Gradient-based).
  • Logic:
    • Reads trained models and validation sets, encodes data according to each model's pipeline, calculates importance, and saves results to separate files and a summary JSON.
  • Key Classes/Methods:
    • FeatureImportanceAnalyzer: A class that encapsulates model/data loading, feature preparation, and various importance methods.
    • run_all_analyses(): A single command to run all analyses and save results to output_dir/{gb_model,bilstm_model,combined_analysis}.
  • Note:
    • Import paths are currently written as from models.hist_gb ... and from models.bilstm_cnn ..., but the actual files are in train_models/. This inconsistency needs to be fixed before running, either by correcting the paths or creating a package with that name.

utils/

utils/config.py

  • Purpose:
    • Centralizes the management of paths for data, models, and results. Provides create_directories() to ensure that directories exist.
  • Note:
    • Currently contains placeholder paths (e.g., /path/to/...). These must be modified according to the environment before execution, including DATA_DIR, TRAIN_DATA, RESULT_DIR, etc.

utils/function.py

  • Purpose:
    • Common Utilities:
      • Self-training Sample Selection (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external final_prob confidence threshold.
      • Save Training Results and Models: save_training_info() saves the .h5 model, a training info JSON, and weights pkl simultaneously.
      • Data Loading: load_data() merges and validates columns (full_seq, label, source) and downsamples EUPLOTES negative samples as needed.
      • Evaluation: evaluate_model_gb() and evaluate_model_cnn() calculate common metrics and logloss.

Interactions & Flow

  1. Use data_pre/mfe.py to calculate and write MFE columns to the data CSVs.
  2. Use data_pre/seaphage_knn.py to filter and supplement training samples based on confidence and KNN.
  3. Training:
    • GB (train_models/hist_gb.py): Construct explicit features from full_seq + mfe_* for training and evaluation.
    • BiLSTM-CNN (train_models/bilstm_cnn.py): End-to-end training on encoded sequences using a hybrid BiLSTM-CNN architecture, with support for iterative self-training.
  4. Analysis: model_feature/feature_analysis.py outputs feature/positional importance and a summary from various methods.