FScanpy-commit-code/ARCH.md

146 lines
9.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FScanpy Commit Code Architecture
## Code Manifest
- `data_pre/`
- `mfe.py`
- `seaphage_knn.py`
- `model_feature/`
- `feature_analysis.py`
- `train_models/`
- `bilstm_cnn.py`
- `hist_gb.py`
- `utils/`
- `config.py`
- `function.py`
## Overview & Pipeline
This repository is organized around the main pipeline: "Read and preprocess sequence data → Construct features and train models → Analyze importance and save results."
1. **Data Preprocessing (`data_pre/`)**
- Calculate and write back structural energy features (MFE), and filter samples based on external confidence scores and KNN distance.
2. **Model Training (`train_models/`)**
- The traditional Gradient Boosting model (HistGradientBoosting) uses explicit features (mono/di/tri-nucleotides + MFE).
- The deep learning model (BiLSTM-CNN) learns representations end-to-end from encoded sequences and supports iterative self-training.
3. **Feature Importance Analysis (`model_feature/`)**
- For the GB model, perform Permutation Importance and SHAP analysis.
- For the BiLSTM-CNN model, perform Integrated Gradients and Saliency Map analysis.
4. **Common Configuration and Utilities (`utils/`)**
- Contains general-purpose functions for path/directory configuration, data loading, evaluation, and results saving.
---
## `data_pre/`
### `data_pre/mfe.py`
* **Purpose**:
- Calls ViennaRNA (`import RNA`) to calculate the Minimum Free Energy (MFE) for given sequence windows and writes the results back to a CSV file.
* **Logic**:
- For each `full_seq`, it crops subsequences based on configuration (default start: 198, lengths: 40 and 120).
- Uses `RNA.fold_compound(sub_seq).mfe()` to calculate MFE and populates the specified columns (`mfe_40bp`, `mfe_120bp`).
* **Key Functions**:
- `predict_rna_structure_multiple(csv_file, predictions)`: Writes back columns for multiple window configurations.
- `calculate_mfe_features(data_file, output_file=None)`: The standard entry point that assembles predictions and calls the above function.
* **Input/Output**:
- **Input**: A CSV file containing `full_seq` (e.g., `BaseConfig.VALIDATION_DATA`).
- **Output**: Appends two columns, `mfe_40bp` and `mfe_120bp`, to the original CSV and saves it.
* **Dependencies**:
- ViennaRNA Python interface (`RNA`).
- `pandas`.
### `data_pre/seaphage_knn.py`
* **Purpose**:
- Selects medium/low-confidence neighbor samples based on externally provided `SEAPHAGES` subset confidence rankings (`final_rank` → `confidence`) and the one-hot encoded distance of sequences. These are then merged with high-confidence samples to create an augmented training set.
* **Logic**:
- Reads `BaseConfig.TRAIN_DATA`, `BaseConfig.TEST_DATA`, and `BaseConfig.SEAPHAGE_PROB` (requires configuration in `BaseConfig`).
- Extracts samples from `SEAPHAGES` with `label==1`, aligns them with the probability table using the `DNA_seqid` prefix to generate `confidence`.
- Converts `FS_period` sequences to one-hot encoding, standardizes them, and uses high-confidence samples as a reference library. It then calculates the average KNN distance for medium/low-confidence samples and filters them based on a quantile threshold.
- Merges and annotates with `confidence_level` (high/medium/low), then saves to a specified CSV.
* **Input/Output**:
- **Input**: Training/testing CSVs and the `seaphage_prob` probability table.
- **Output**: The filtered `seaphage_selected.csv` (Note: the output path is currently hardcoded and should be managed by `BaseConfig`).
* **Dependencies**:
- `pandas`, `numpy`, `scikit-learn` (`NearestNeighbors`, `StandardScaler`).
---
## `train_models/`
### `train_models/hist_gb.py`
* **Purpose**:
- Uses `HistGradientBoostingClassifier` for frameshift site classification, training, and evaluation. It supports constructing explicit features from `full_seq` and adding MFE features.
* **Logic and Features**:
- A central crop of `GBConfig.SEQUENCE_LENGTH=33` is used to maintain the reading frame, constructing:
- Mononucleotide one-hot features (4×L).
- Dinucleotide features (16×(L-1)).
- Trinucleotide (codon) features (64×(L-2)).
- Structural energy features: `mfe_40bp`, `mfe_120bp`.
* **Key Functions**:
- `sequence_to_features()`, `prepare_data()`: Generate the feature matrix and weights from `full_seq` and MFE columns.
- `train_hist_model()`: Handles training, validation split, and evaluation (on test set and external `Xu/Atkins` sets).
- `analyze_feature_importance()`: Exports built-in feature importance to a CSV file.
* **Input/Output**:
- **Input**: Merged data from `BaseConfig.DATA_DIR` (`merged_train_data.csv`, `merged_test_data.csv`, etc.).
- **Output**: Model object, evaluation metrics, and an importance CSV (saved to `BaseConfig.GB_DIR`).
### `train_models/bilstm_cnn.py`
* **Purpose**:
- An end-to-end sequence classification model with a hybrid architecture. The model processes sequences through the following layers: `Input``Embedding``BiLSTM``Parallel CNN Branches``Concatenation``Dense Layers``Sigmoid Output`.
- Supports self-training: iteratively selects pseudo-labeled samples from a pool of low-confidence samples to add to the training set.
* **Logic**:
- **Sequence Encoding**: `encode_sequence()` converts 'ATCG' to {1,2,3,4} and pads/trims to `Config.Sequence_len=399`.
- **Training Monitoring**: `MetricsCallback` calculates test set metrics at each epoch to track the best-performing model.
- **Self-training Loop**: Calls `utils.function.select_low_confidence_samples_cnn()` to select samples based on model probability and a given `final_prob`.
* **Key Functions**:
- `create_bilstm_cnn_model()`, `prepare_data()`, `train_bilstm_cnn_model()`.
- `main()`: Loads data, trains the model, and saves the best and final models along with training information using `save_training_info()`.
* **Input/Output**:
- **Input**: Train/test sets returned by `load_data()`, and optional external validation sets (Xu/Atkins).
- **Output**: Saves `*.h5` model, `*_training_info.json`, and `*_weights.pkl` to `BaseConfig.BILSTM_MODEL_DIR`.
---
## `model_feature/`
### `model_feature/feature_analysis.py`
* **Purpose**:
- Provides a unified interface for feature importance analysis:
- **GB Model**: Permutation Importance (`sklearn.inspection.permutation_importance`) and SHAP.
- **BiLSTM-CNN Model**: Integrated Gradients and Saliency Maps (Gradient-based).
* **Logic**:
- Reads trained models and validation sets, encodes data according to each model's pipeline, calculates importance, and saves results to separate files and a summary JSON.
* **Key Classes/Methods**:
- `FeatureImportanceAnalyzer`: A class that encapsulates model/data loading, feature preparation, and various importance methods.
- `run_all_analyses()`: A single command to run all analyses and save results to `output_dir/{gb_model,bilstm_model,combined_analysis}`.
* **Note**:
- Import paths are currently written as `from models.hist_gb ...` and `from models.bilstm_cnn ...`, but the actual files are in `train_models/`. This inconsistency needs to be fixed before running, either by correcting the paths or creating a package with that name.
---
## `utils/`
### `utils/config.py`
* **Purpose**:
- Centralizes the management of paths for data, models, and results. Provides `create_directories()` to ensure that directories exist.
* **Note**:
- Currently contains placeholder paths (e.g., `/path/to/...`). These must be modified according to the environment before execution, including `DATA_DIR`, `TRAIN_DATA`, `RESULT_DIR`, etc.
### `utils/function.py`
* **Purpose**:
- **Common Utilities**:
- **Self-training Sample Selection** (two sets for CNN/GB): Selects pseudo-labeled samples based on model probability, entropy, and an external `final_prob` confidence threshold.
- **Save Training Results and Models**: `save_training_info()` saves the `.h5` model, a training info JSON, and weights pkl simultaneously.
- **Data Loading**: `load_data()` merges and validates columns (`full_seq`, `label`, `source`) and downsamples `EUPLOTES` negative samples as needed.
- **Evaluation**: `evaluate_model_gb()` and `evaluate_model_cnn()` calculate common metrics and logloss.
---
## Interactions & Flow
1. Use `data_pre/mfe.py` to calculate and write MFE columns to the data CSVs.
2. Use `data_pre/seaphage_knn.py` to filter and supplement training samples based on confidence and KNN.
3. **Training**:
- **GB (`train_models/hist_gb.py`)**: Construct explicit features from `full_seq` + `mfe_*` for training and evaluation.
- **BiLSTM-CNN (`train_models/bilstm_cnn.py`)**: End-to-end training on encoded sequences using a hybrid BiLSTM-CNN architecture, with support for iterative self-training.
4. **Analysis**: `model_feature/feature_analysis.py` outputs feature/positional importance and a summary from various methods.
---