Course Projects


MPMalGen: Multi-Platform LLM-Powered Malware Variant Generation

A framework extending LLM-based malware variant generation to multi-platform environments (Windows & Android), featuring platform-specific transformation strategies and automated validation pipelines.

Highlights

  • Extended the LLMalMorph framework with multi-platform support for Windows (C/C++) and Android (Java/Kotlin/Smali) malware sources.
  • Introduced Android-specific prompt engineering strategies preserving component lifecycles, manifest registrations, and permission models.
  • Integrated DeepSeek-Coder-v2-16B-Instruct for superior code transformation; implemented automated syntactic validation and error recovery pipelines.
  • Evaluated on 4 Windows malware families and 1 Android project and 1 APK; achieved similar AV evasion rates compared to SOTA.

System Architecture

flowchart TD
    A[(Malware Source)] --> B[Multi-Platform Ingestion]
    B --> C[Platform-Aware AST Parser]
    C --> D[Function Extraction]
    D --> E[Prompt Engineering]
    E --> F[DeepSeek-Coder-v2 LLM]
    F --> G[Syntactic Validation]
    G --> H{Compile OK}
    H -->|No| G
    H -->|Yes| I[Recompiler]
    I --> J[(AV Evaluation)]

Report: Download PDF
Presentation: Watch Video


Abstractive Title Generation

Sequence-to-sequence models for generating concise paper titles from scientific abstracts, comparing model capacity and instruction tuning.

Highlights

  • Fine-tuned T5-small, T5-base, and Flan-T5-base with beam search (beams=4), no-repeat bigrams, and early stopping.
  • T5-base improved BLEU by ~19% and ROUGE-L by ~8% over T5-small; Flan-T5 achieved the highest test ROUGE-L (0.439).
  • Conducted hyperparameter sweeps on epochs and learning rate; identified 5 epochs and lr = 5×10⁻⁴ as optimal for the mid-sized corpus.

System Architecture

flowchart TD
    A[(Abstract Text)] --> B[Prompt Template]
    B --> C[T5 Flan-T5 Encoder]
    C --> D[Encoder Hidden States]
    D --> E[T5 Decoder Beam Search]
    E --> F[Generated Title Tokens]
    F --> G[Detokenizer]
    G --> H[(Predicted Title)]
    H --> I[BLEU ROUGE Evaluation]

Report: Download PDF


Biomedical Document Classification

Binary classification of scientific abstracts using transformer models, addressing severe class imbalance (~10:1) and train-test distribution mismatch.

Highlights

  • Fine-tuned BERT, DeBERTa-v3, and PubMedBERT with focal loss and weighted sampling to handle class imbalance.
  • Implemented curriculum learning by ranking samples by prediction uncertainty and training on progressively harder subsets.
  • Ensembled DeBERTa and PubMedBERT via probability averaging; achieved max F1 ≈ 0.88 and F1 ≈ 0.87 on the public and private leaderboards respectively in the corresponding Kagle competition.

System Architecture

flowchart TD
    A[(Title and Abstract)] --> B[Prompt Formatter]
    B --> C[Tokenizer]
    C --> D[BERT Baseline]
    C --> E[DeBERTa-v3]
    C --> F[PubMedBERT]
    D --> G[Logits]
    E --> H[Logits]
    F --> I[Logits]
    G --> J[Ensemble Averaging]
    H --> J
    I --> J
    J --> K[Threshold Tuning]
    K --> L[(Final Prediction)]

Report: Download PDF


Phrase Mining & Word Embeddings

An end-to-end pipeline for extracting and evaluating multi-word phrases from a large biomedical corpus, then retraining Word2Vec embeddings on the phrase-tagged text.

Highlights

  • Compared three phrase-mining strategies: naive greedy segmentation, a statistical mixed-method approach (PMI + TF-IDF + spaCy noun chunks), and a BioBERT-based semantic clustering method.
  • Trained Word2Vec (skip-gram) before and after phrase tagging; observed tighter semantic clusters and higher cosine similarities for domain-specific terms after tagging.
  • Validated extracted phrases against a curated phrase dictionary; the mixed-method approach achieved the best balance of coverage and precision.

System Architecture

flowchart TD
    A[(Raw Corpus)] --> B[Tokenizer Cleaner]
    B --> C[N-gram Generator]
    C --> D[Naive Greedy]
    C --> E[PMI TF-IDF NP]
    C --> F[BioBERT Embeddings]
    D --> G[Phrase Candidates]
    E --> H[Phrase Candidates]
    F --> I[Semantic Clusters]
    G --> J[Merge and Rank]
    H --> J
    I --> J
    J --> K[Phrase-Tagged Corpus]
    K --> L[Word2Vec Retraining]
    L --> M[(Similarity Evaluation)]

Report: Download PDF
Presentation: Watch Video