Projects
WeatherLake: Real-Time Weather Ingestion and NLP-driven Querying
I built a real‑time weather‑data pipeline that ingested sensor feeds through Apache Kafka with exactly‑once guarantees, then routed the streams to Apache Spark Structured Streaming for on‑the‑fly validation, enrichment, and aggregation. The cleansed, enriched records landed in an Apache Iceberg table on MinIO, giving us schema‑evolution flexibility and time‑travel queries. Finally, I wired Claude 3 Sonnet into the stack so stakeholders could ask plain‑English questions—“What was yesterday’s peak humidity in Boston?”—and get answers instantly via auto‑generated Spark SQL.
Multi-Label Toxic Comment Classification
I carried out exploratory data analysis with Seaborn and Matplotlib to uncover toxicity patterns and feature correlations, then cleaned and normalized the text via NLTK + SpaCy tokenization and lemmatization. With the data prepared, I trained a parallel‑channel CNN–GRU neural network for multi‑label toxicity classification, enriching its semantic understanding through pre‑trained GloVe embeddings. Finally, I benchmarked the model with ROC‑AUC and accuracy metrics, confirming strong performance across all toxicity labels.