Nguyen's Portfolio

A I E N G I N E E R

About Experience Skills Projects Publications

A B O U T

Here is a little background

I am Nguyen Phuoc Nguyen, a graduated student majored in Computer Science at Ton Duc Thang University. My interests include developing advanced Vietnamese Large Language Models for widespread use in Vietnam and other countries.

I also enjoy building high‑quality datasets to enhance model performance. Dedicated to AI/ML engineering, I have experienced as AI Engineer at FPT Telecom and AI Researcher at TDTU NLP‑KD Lab, where I participated in several projects related to Natural Language Processing and Computer Vision.

E X P E R I E N C E

AI Engineer

FPT Telecom

June 2023 - Present

PythonYOLOv8LangchainLanggraph

Developing AI products related to Computer Vision, NLP, Agentic AI
Using YOLOv8, Langchain, Langgraph, Python, etc.

AI Researcher

TDTU NLP-KD Lab

March 2023 - Present

PythonPyTorchHuggingFaceGitUnsloth

Survey trends in AI, building datasets, developing, training and finetuning language models
Using HuggingFace, Unsloth, Langchain, Langgraph, Python, Pytorch, Git, etc

S K I L L S

AI & Machine Learning

Natural Language ProcessingComputer VisionLarge Language Models

Programming

Python

Tools & Platforms

GitDockerAWSLinuxVS CodeJupyterHuggingFace

P R O J E C T S

Developing Small Vietnamese Legal LLM

TDTU NLP-KD Lab

Jan 2024 - Aug 2024 • Ho Chi Minh City, Vietnam

Continued-pretrain small LLMs, Sailor-1.8B and Qwen2-1.5B, on Vietnamese data to improve language understanding.
Build legal dataset for pre-training, fine-tuning, evaluating in different types such as Question Answering, Fill-in-the-middle, NLI, etc.
Fine-tune models using ORPO, LoRA, QLoRA, MoRA, etc.

Technical Skills: HuggingFace, Unsloth, Git, LoRA, MoRA, ORPO, etc.

Building datasets for LLMs

TDTU NLP-KD Lab

2023 - Present • Ho Chi Minh City, Vietnam

Pre-training datasets: scraping and extracting texts from webs, cleaning data
Fine-tuning datasets: building a variety of datasets for fine-tuning models in different tasks such as summarization, question answering, translation, etc.
The data collected from different domains such as Legal, Health, News, Education, etc.

Technical Skills: Scrapy, BeautifulSoup, Trafilatura, etc.

Education Assistant

TDTU NLP-KD Lab

Feb 2024 - Present • Ho Chi Minh City, Vietnam

Build Education Assistant using Retrieval-augmented Generation (RAG) on specific universities' knowledge.
Apply multiple techniques to improve RAG system like Query Expansion, Intent Classification, Hybrid Search, Re-ranking.

Technical Skills: Python, Elasticsearch, OpenAI API, Gemini API, Cohere Re-rank API, LlamaIndex, etc.

Legal Assistant

Ton Duc Thang University

Dec 2023 - Apr 2024 • Ho Chi Minh City, Vietnam

Build Legal Assistant using Retrieval-augmented Generation (RAG).
Apply multiple techniques to improve RAG system like Query Expansion, Intent Classification, Hybrid Search, Re-ranking.

Technical Skills: Python, Elasticsearch, OpenAI API, Gemini API, Cohere Re-rank API, LlamaIndex, etc.

P U B L I C A T I O N S

Adapting Large Language Models to Vietnamese Law: Pretrained LLM Refinement vs Retrieval Augmented Generation

The 16th IEEE International Conference on Knowledge and Systems Engineering (KSE 2024)

Authors: Nguyen P. Nguyen, Thang V.Q. Le, Anh‑Cuong Le, Viet‑Ha Nguyen, Viet‑Cuong Nguyen

Large Language Models (LLMs) have shown theirpotential in a wide range of tasks, but much of their development has been concentrated on the English language. This focus has created a noticeable gap in the availability of LLMs for other languages, as well as in specialized domains such as legal contexts. In this study, we aim to address this gap by developing a Vietnamese Large Language Model and building a Retrieval-Augmented Generation (RAG) system within legal settings. Our methodology includes building and processing legal datasets, followed by training the LLM specifically for Vietnamese legal applications. Our findings suggest that both the LLM and the RAG system perform well in retaining legal knowledge and providing more reliable answers to legal inquiries. These results highlight the potential of these approaches for legal tasks in the Vietnamese language. This research not only contributes to the application of AI in the legal field but also offers important insights into the development of AI solutions for non-English languages, addressing a critical gap in current AI research and supporting more inclusive language model applications.

Small Language ModelRetrieval Augmented GenerationNatural Language ProcessingLegal NLP

Enhancing Reading Comprehension of Vietnamese LLMs with Synthetic data

The 16th IEEE International Conference on Knowledge and Systems Engineering (KSE 2024) ‑ Best Paper Award

Authors: Thang V.Q. Le, Nguyen P. Nguyen, Trong‑Chi Duong, Anh‑Cuong Le, Viet‑Cuong Nguyen, Viet‑Ha Nguyen

Large Language Models (LLMs) have demonstrated remarkable capabilities in addressing a wide array of general problems. However, there is a growing recognition of the need for domain-specific expertise in certain fields. This has led to an emerging trend in the development of specialized LLMs, often referred to as expert models. In line with this trend, our research presents the development of Vilaw, a small language model specifically designed for the Vietnamese legal domain. Our approach combines innovative techniques to enhance model performance, including the use of legal synthetic data and experiments on Small Language Models (SLMs) for continuing pre-training and supervised fine-tuning in the legal domain. We created new types of synthetic data by adding question-answering generated with legal data in pre-training phase and developed a three-level of Bloom's taxonomy SFT dataset: knowledge, comprehension and application, especially legal syllogism question-answering data. The results demonstrate significant improvements in the model's capabilities within the Vietnamese legal context compared to baseline models. Our Vilaw model exhibit enhanced legal reasoning and knowledge application, showcasing the effectiveness of our domain-specific training approach.

Small Language ModelSynthetic DataNatural Language ProcessingLegal NLP

A Framework for Vietnamese Question‑Answering in Law Domain

The 9th IEEE International Conference on Data Science in Cyberspace (IEEE DSC 2024)

Authors: Thang V. Q. Le, Dinh‑Hong Vu, Nguyen P. Nguyen, Anh‑Cuong Le

The popularity of building question-answering systems using Large Language Models (LLMs) has surged. Many projects leverage the Retrieval Augmented Generated (RAG) technique, involving two basic steps: retrieval and reading. In this research, we introduce an enhanced approach, termed CRRR (Classifier - Rewrite - Retriever - Reader), tailored for the Vietnamese legal domain. Our framework begins with a classifier to discern whether a given question pertains to law. Rather than solely focusing on refining LLM or embedding models for better responses, we prioritize enhancing the process of rewriting input questions. These rewritten queries are then utilized by a search engine to gather external information, aiding the reader in generating answers. The rewriter component is trainable using reinforcement learning, incorporating feedback from both human and AI sources.

Retrieval Augmented GenerationNatural Language ProcessingLegal NLP

PDF View Online Code

C O N T A C T

I have got just what you need. Lets talk.

+84 985 755 720

nguyenst279@gmail.com

Ho Chi Minh City, Vietnam