
A I E N G I N E E R
A B O U T

Here is a little background
I am Nguyen Phuoc Nguyen, a graduated student majored in Computer Science at Ton Duc Thang University. My interests include developing advanced Vietnamese Large Language Models for widespread use in Vietnam and other countries.
I also enjoy building high‑quality datasets to enhance model performance. Dedicated to AI/ML engineering, I have experienced as AI Engineer at FPT Telecom and AI Researcher at TDTU NLP‑KD Lab, where I participated in several projects related to Natural Language Processing and Computer Vision.
E X P E R I E N C E
S K I L L S
AI & Machine Learning
Programming
Tools & Platforms
P R O J E C T S
P U B L I C A T I O N S
Adapting Large Language Models to Vietnamese Law: Pretrained LLM Refinement vs Retrieval Augmented Generation
The 16th IEEE International Conference on Knowledge and Systems Engineering (KSE 2024)
Authors: Nguyen P. Nguyen, Thang V.Q. Le, Anh‑Cuong Le, Viet‑Ha Nguyen, Viet‑Cuong Nguyen
Large Language Models (LLMs) have shown theirpotential in a wide range of tasks, but much of their development has been concentrated on the English language. This focus has created a noticeable gap in the availability of LLMs for other languages, as well as in specialized domains such as legal contexts. In this study, we aim to address this gap by developing a Vietnamese Large Language Model and building a Retrieval-Augmented Generation (RAG) system within legal settings. Our methodology includes building and processing legal datasets, followed by training the LLM specifically for Vietnamese legal applications. Our findings suggest that both the LLM and the RAG system perform well in retaining legal knowledge and providing more reliable answers to legal inquiries. These results highlight the potential of these approaches for legal tasks in the Vietnamese language. This research not only contributes to the application of AI in the legal field but also offers important insights into the development of AI solutions for non-English languages, addressing a critical gap in current AI research and supporting more inclusive language model applications.
Enhancing Reading Comprehension of Vietnamese LLMs with Synthetic data
The 16th IEEE International Conference on Knowledge and Systems Engineering (KSE 2024) ‑ Best Paper Award
Authors: Thang V.Q. Le, Nguyen P. Nguyen, Trong‑Chi Duong, Anh‑Cuong Le, Viet‑Cuong Nguyen, Viet‑Ha Nguyen
Large Language Models (LLMs) have demonstrated remarkable capabilities in addressing a wide array of general problems. However, there is a growing recognition of the need for domain-specific expertise in certain fields. This has led to an emerging trend in the development of specialized LLMs, often referred to as expert models. In line with this trend, our research presents the development of Vilaw, a small language model specifically designed for the Vietnamese legal domain. Our approach combines innovative techniques to enhance model performance, including the use of legal synthetic data and experiments on Small Language Models (SLMs) for continuing pre-training and supervised fine-tuning in the legal domain. We created new types of synthetic data by adding question-answering generated with legal data in pre-training phase and developed a three-level of Bloom's taxonomy SFT dataset: knowledge, comprehension and application, especially legal syllogism question-answering data. The results demonstrate significant improvements in the model's capabilities within the Vietnamese legal context compared to baseline models. Our Vilaw model exhibit enhanced legal reasoning and knowledge application, showcasing the effectiveness of our domain-specific training approach.
A Framework for Vietnamese Question‑Answering in Law Domain
The 9th IEEE International Conference on Data Science in Cyberspace (IEEE DSC 2024)
Authors: Thang V. Q. Le, Dinh‑Hong Vu, Nguyen P. Nguyen, Anh‑Cuong Le
The popularity of building question-answering systems using Large Language Models (LLMs) has surged. Many projects leverage the Retrieval Augmented Generated (RAG) technique, involving two basic steps: retrieval and reading. In this research, we introduce an enhanced approach, termed CRRR (Classifier - Rewrite - Retriever - Reader), tailored for the Vietnamese legal domain. Our framework begins with a classifier to discern whether a given question pertains to law. Rather than solely focusing on refining LLM or embedding models for better responses, we prioritize enhancing the process of rewriting input questions. These rewritten queries are then utilized by a search engine to gather external information, aiding the reader in generating answers. The rewriter component is trainable using reinforcement learning, incorporating feedback from both human and AI sources.