Course Projects

MPMalGen: Multi-Platform LLM-Powered Malware Variant Generation

A framework extending LLM-based malware variant generation to multi-platform environments (Windows & Android), featuring platform-specific transformation strategies and automated validation pipelines.

Highlights

Extended the LLMalMorph framework with multi-platform support for Windows (C/C++) and Android (Java/Kotlin/Smali) malware sources.
Introduced Android-specific prompt engineering strategies preserving component lifecycles, manifest registrations, and permission models.
Integrated DeepSeek-Coder-v2-16B-Instruct for superior code transformation; implemented automated syntactic validation and error recovery pipelines.
Evaluated on 4 Windows malware families and 1 Android project and 1 APK; achieved similar AV evasion rates compared to SOTA.

System Architecture

flowchart TD
    A[(Malware Source)] --> B[Multi-Platform Ingestion]
    B --> C[Platform-Aware AST Parser]
    C --> D[Function Extraction]
    D --> E[Prompt Engineering]
    E --> F[DeepSeek-Coder-v2 LLM]
    F --> G[Syntactic Validation]
    G --> H{Compile OK}
    H -->|No| G
    H -->|Yes| I[Recompiler]
    I --> J[(AV Evaluation)]

Report: Download PDF
Presentation: Watch Video

Abstractive Title Generation

Sequence-to-sequence models for generating concise paper titles from scientific abstracts, comparing model capacity and instruction tuning.

Highlights

Fine-tuned T5-small, T5-base, and Flan-T5-base with beam search (beams=4), no-repeat bigrams, and early stopping.
T5-base improved BLEU by ~19% and ROUGE-L by ~8% over T5-small; Flan-T5 achieved the highest test ROUGE-L (0.439).
Conducted hyperparameter sweeps on epochs and learning rate; identified 5 epochs and lr = 5×10⁻⁴ as optimal for the mid-sized corpus.

System Architecture

flowchart TD
    A[(Abstract Text)] --> B[Prompt Template]
    B --> C[T5 Flan-T5 Encoder]
    C --> D[Encoder Hidden States]
    D --> E[T5 Decoder Beam Search]
    E --> F[Generated Title Tokens]
    F --> G[Detokenizer]
    G --> H[(Predicted Title)]
    H --> I[BLEU ROUGE Evaluation]

Report: Download PDF

Biomedical Document Classification

Binary classification of scientific abstracts using transformer models, addressing severe class imbalance (~10:1) and train-test distribution mismatch.

Highlights

Fine-tuned BERT, DeBERTa-v3, and PubMedBERT with focal loss and weighted sampling to handle class imbalance.
Implemented curriculum learning by ranking samples by prediction uncertainty and training on progressively harder subsets.
Ensembled DeBERTa and PubMedBERT via probability averaging; achieved max F1 ≈ 0.88 and F1 ≈ 0.87 on the public and private leaderboards respectively in the corresponding Kagle competition.

System Architecture

flowchart TD
    A[(Title and Abstract)] --> B[Prompt Formatter]
    B --> C[Tokenizer]
    C --> D[BERT Baseline]
    C --> E[DeBERTa-v3]
    C --> F[PubMedBERT]
    D --> G[Logits]
    E --> H[Logits]
    F --> I[Logits]
    G --> J[Ensemble Averaging]
    H --> J
    I --> J
    J --> K[Threshold Tuning]
    K --> L[(Final Prediction)]

Report: Download PDF

Phrase Mining & Word Embeddings

An end-to-end pipeline for extracting and evaluating multi-word phrases from a large biomedical corpus, then retraining Word2Vec embeddings on the phrase-tagged text.

Highlights

Compared three phrase-mining strategies: naive greedy segmentation, a statistical mixed-method approach (PMI + TF-IDF + spaCy noun chunks), and a BioBERT-based semantic clustering method.
Trained Word2Vec (skip-gram) before and after phrase tagging; observed tighter semantic clusters and higher cosine similarities for domain-specific terms after tagging.
Validated extracted phrases against a curated phrase dictionary; the mixed-method approach achieved the best balance of coverage and precision.

System Architecture

flowchart TD
    A[(Raw Corpus)] --> B[Tokenizer Cleaner]
    B --> C[N-gram Generator]
    C --> D[Naive Greedy]
    C --> E[PMI TF-IDF NP]
    C --> F[BioBERT Embeddings]
    D --> G[Phrase Candidates]
    E --> H[Phrase Candidates]
    F --> I[Semantic Clusters]
    G --> J[Merge and Rank]
    H --> J
    I --> J
    J --> K[Phrase-Tagged Corpus]
    K --> L[Word2Vec Retraining]
    L --> M[(Similarity Evaluation)]

Report: Download PDF
Presentation: Watch Video

Md Mohaiminul Islam

MPMalGen: Multi-Platform LLM-Powered Malware Variant Generation

Abstractive Title Generation

Biomedical Document Classification

Phrase Mining & Word Embeddings