146 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
		
		
			
		
	
	
			146 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
|  | # FScanpy Commit Code Architecture
 | |||
|  | 
 | |||
|  | ## Code Manifest
 | |||
|  | - `data_pre/` | |||
|  |   - `mfe.py` | |||
|  |   - `seaphage_knn.py` | |||
|  | - `model_feature/` | |||
|  |   - `feature_analysis.py` | |||
|  | - `train_models/` | |||
|  |   - `bilstm_cnn.py` | |||
|  |   - `hist_gb.py` | |||
|  | - `utils/` | |||
|  |   - `config.py` | |||
|  |   - `function.py` | |||
|  | 
 | |||
|  | ## Overview & Pipeline
 | |||
|  | 
 | |||
|  | This repository is organized around the main pipeline: "Read and preprocess sequence data → Construct features and train models → Analyze importance and save results." | |||
|  | 
 | |||
|  | 1.  **Data Preprocessing (`data_pre/`)** | |||
|  |     -   Calculate and write back structural energy features (MFE), and filter samples based on external confidence scores and KNN distance. | |||
|  | 2.  **Model Training (`train_models/`)** | |||
|  |     -   The traditional Gradient Boosting model (HistGradientBoosting) uses explicit features (mono/di/tri-nucleotides + MFE). | |||
|  |     -   The deep learning model (BiLSTM-CNN) learns representations end-to-end from encoded sequences and supports iterative self-training. | |||
|  | 3.  **Feature Importance Analysis (`model_feature/`)** | |||
|  |     -   For the GB model, perform Permutation Importance and SHAP analysis. | |||
|  |     -   For the BiLSTM-CNN model, perform Integrated Gradients and Saliency Map analysis. | |||
|  | 4.  **Common Configuration and Utilities (`utils/`)** | |||
|  |     -   Contains general-purpose functions for path/directory configuration, data loading, evaluation, and results saving. | |||
|  | 
 | |||
|  | --- | |||
|  | 
 | |||
|  | ## `data_pre/`
 | |||
|  | 
 | |||
|  | ### `data_pre/mfe.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   Calls ViennaRNA (`import RNA`) to calculate the Minimum Free Energy (MFE) for given sequence windows and writes the results back to a CSV file. | |||
|  | *   **Logic**: | |||
|  |     -   For each `full_seq`, it crops subsequences based on configuration (default start: 198, lengths: 40 and 120). | |||
|  |     -   Uses `RNA.fold_compound(sub_seq).mfe()` to calculate MFE and populates the specified columns (`mfe_40bp`, `mfe_120bp`). | |||
|  | *   **Key Functions**: | |||
|  |     -   `predict_rna_structure_multiple(csv_file, predictions)`: Writes back columns for multiple window configurations. | |||
|  |     -   `calculate_mfe_features(data_file, output_file=None)`: The standard entry point that assembles predictions and calls the above function. | |||
|  | *   **Input/Output**: | |||
|  |     -   **Input**: A CSV file containing `full_seq` (e.g., `BaseConfig.VALIDATION_DATA`). | |||
|  |     -   **Output**: Appends two columns, `mfe_40bp` and `mfe_120bp`, to the original CSV and saves it. | |||
|  | *   **Dependencies**: | |||
|  |     -   ViennaRNA Python interface (`RNA`). | |||
|  |     -   `pandas`. | |||
|  | 
 | |||
|  | ### `data_pre/seaphage_knn.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   Selects medium/low-confidence neighbor samples based on externally provided `SEAPHAGES` subset confidence rankings (`final_rank` → `confidence`) and the one-hot encoded distance of sequences. These are then merged with high-confidence samples to create an augmented training set. | |||
|  | *   **Logic**: | |||
|  |     -   Reads `BaseConfig.TRAIN_DATA`, `BaseConfig.TEST_DATA`, and `BaseConfig.SEAPHAGE_PROB` (requires configuration in `BaseConfig`). | |||
|  |     -   Extracts samples from `SEAPHAGES` with `label==1`, aligns them with the probability table using the `DNA_seqid` prefix to generate `confidence`. | |||
|  |     -   Converts `FS_period` sequences to one-hot encoding, standardizes them, and uses high-confidence samples as a reference library. It then calculates the average KNN distance for medium/low-confidence samples and filters them based on a quantile threshold. | |||
|  |     -   Merges and annotates with `confidence_level` (high/medium/low), then saves to a specified CSV. | |||
|  | *   **Input/Output**: | |||
|  |     -   **Input**: Training/testing CSVs and the `seaphage_prob` probability table. | |||
|  |     -   **Output**: The filtered `seaphage_selected.csv` (Note: the output path is currently hardcoded and should be managed by `BaseConfig`). | |||
|  | *   **Dependencies**: | |||
|  |     -   `pandas`, `numpy`, `scikit-learn` (`NearestNeighbors`, `StandardScaler`). | |||
|  | 
 | |||
|  | --- | |||
|  | 
 | |||
|  | ## `train_models/`
 | |||
|  | 
 | |||
|  | ### `train_models/hist_gb.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   Uses `HistGradientBoostingClassifier` for frameshift site classification, training, and evaluation. It supports constructing explicit features from `full_seq` and adding MFE features. | |||
|  | *   **Logic and Features**: | |||
|  |     -   A central crop of `GBConfig.SEQUENCE_LENGTH=33` is used to maintain the reading frame, constructing: | |||
|  |         -   Mononucleotide one-hot features (4×L). | |||
|  |         -   Dinucleotide features (16×(L-1)). | |||
|  |         -   Trinucleotide (codon) features (64×(L-2)). | |||
|  |         -   Structural energy features: `mfe_40bp`, `mfe_120bp`. | |||
|  | *   **Key Functions**: | |||
|  |     -   `sequence_to_features()`, `prepare_data()`: Generate the feature matrix and weights from `full_seq` and MFE columns. | |||
|  |     -   `train_hist_model()`: Handles training, validation split, and evaluation (on test set and external `Xu/Atkins` sets). | |||
|  |     -   `analyze_feature_importance()`: Exports built-in feature importance to a CSV file. | |||
|  | *   **Input/Output**: | |||
|  |     -   **Input**: Merged data from `BaseConfig.DATA_DIR` (`merged_train_data.csv`, `merged_test_data.csv`, etc.). | |||
|  |     -   **Output**: Model object, evaluation metrics, and an importance CSV (saved to `BaseConfig.GB_DIR`). | |||
|  | 
 | |||
|  | ### `train_models/bilstm_cnn.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers: `Input` → `Embedding` → `BiLSTM` → `Parallel CNN Branches` → `Concatenation` → `Dense Layers` → `Sigmoid Output`. | |||
|  |     -   Supports self-training: iteratively selects pseudo-labeled samples from a pool of low-confidence samples to add to the training set. | |||
|  | *   **Logic**: | |||
|  |     -   **Sequence Encoding**: `encode_sequence()` converts 'ATCG' to {1,2,3,4} and pads/trims to `Config.Sequence_len=399`. | |||
|  |     -   **Training Monitoring**: `MetricsCallback` calculates test set metrics at each epoch to track the best-performing model. | |||
|  |     -   **Self-training Loop**: Calls `utils.function.select_low_confidence_samples_cnn()` to select samples based on model probability and a given `final_prob`. | |||
|  | *   **Key Functions**: | |||
|  |     -   `create_bilstm_cnn_model()`, `prepare_data()`, `train_bilstm_cnn_model()`. | |||
|  |     -   `main()`: Loads data, trains the model, and saves the best and final models along with training information using `save_training_info()`. | |||
|  | *   **Input/Output**: | |||
|  |     -   **Input**: Train/test sets returned by `load_data()`, and optional external validation sets (Xu/Atkins). | |||
|  |     -   **Output**: Saves `*.h5` model, `*_training_info.json`, and `*_weights.pkl` to `BaseConfig.BILSTM_MODEL_DIR`. | |||
|  | 
 | |||
|  | --- | |||
|  | 
 | |||
|  | ## `model_feature/`
 | |||
|  | 
 | |||
|  | ### `model_feature/feature_analysis.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   Provides a unified interface for feature importance analysis: | |||
|  |         -   **GB Model**: Permutation Importance (`sklearn.inspection.permutation_importance`) and SHAP. | |||
|  |         -   **BiLSTM-CNN Model**: Integrated Gradients and Saliency Maps (Gradient-based). | |||
|  | *   **Logic**: | |||
|  |     -   Reads trained models and validation sets, encodes data according to each model's pipeline, calculates importance, and saves results to separate files and a summary JSON. | |||
|  | *   **Key Classes/Methods**: | |||
|  |     -   `FeatureImportanceAnalyzer`: A class that encapsulates model/data loading, feature preparation, and various importance methods. | |||
|  |     -   `run_all_analyses()`: A single command to run all analyses and save results to `output_dir/{gb_model,bilstm_model,combined_analysis}`. | |||
|  | *   **Note**: | |||
|  |     -   Import paths are currently written as `from models.hist_gb ...` and `from models.bilstm_cnn ...`, but the actual files are in `train_models/`. This inconsistency needs to be fixed before running, either by correcting the paths or creating a package with that name. | |||
|  | 
 | |||
|  | --- | |||
|  | 
 | |||
|  | ## `utils/`
 | |||
|  | 
 | |||
|  | ### `utils/config.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   Centralizes the management of paths for data, models, and results. Provides `create_directories()` to ensure that directories exist. | |||
|  | *   **Note**: | |||
|  |     -   Currently contains placeholder paths (e.g., `/path/to/...`). These must be modified according to the environment before execution, including `DATA_DIR`, `TRAIN_DATA`, `RESULT_DIR`, etc. | |||
|  | 
 | |||
|  | ### `utils/function.py`
 | |||
|  | *   **Purpose**: | |||
|  |     -   **Common Utilities**: | |||
|  |         -   **Self-training Sample Selection** (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external `final_prob` confidence threshold. | |||
|  |         -   **Save Training Results and Models**: `save_training_info()` saves the `.h5` model, a training info JSON, and weights pkl simultaneously. | |||
|  |         -   **Data Loading**: `load_data()` merges and validates columns (`full_seq`, `label`, `source`) and downsamples `EUPLOTES` negative samples as needed. | |||
|  |         -   **Evaluation**: `evaluate_model_gb()` and `evaluate_model_cnn()` calculate common metrics and logloss. | |||
|  | 
 | |||
|  | --- | |||
|  | 
 | |||
|  | ## Interactions & Flow
 | |||
|  | 1.  Use `data_pre/mfe.py` to calculate and write MFE columns to the data CSVs. | |||
|  | 2.  Use `data_pre/seaphage_knn.py` to filter and supplement training samples based on confidence and KNN. | |||
|  | 3.  **Training**: | |||
|  |     -   **GB (`train_models/hist_gb.py`)**: Construct explicit features from `full_seq` + `mfe_*` for training and evaluation. | |||
|  |     -   **BiLSTM-CNN (`train_models/bilstm_cnn.py`)**: End-to-end training on encoded sequences using a hybrid BiLSTM-CNN architecture, with support for iterative self-training. | |||
|  | 4.  **Analysis**: `model_feature/feature_analysis.py` outputs feature/positional importance and a summary from various methods. | |||
|  | 
 | |||
|  | --- |