Watermarking Text Data on Large Language Models for Dataset Copyright Protection

Yixin Liu, Hongsheng Hu, Xuyun Zhang, Lichao Sun

Large Language Models (LLMs), such as BERT and GPT-based models like ChatGPT, have recently demonstrated their impressive capacity for learning language representations, yielding significant benefits for various downstream Natural Language Processing (NLP) tasks. However, the immense data requirements of these large models have incited substantial concerns regarding copyright protection and data privacy. In an attempt to address these issues, particularly the unauthorized use of private data in LLMs, we introduce a novel watermarking technique via a backdoor-based membership inference approach, i.e., TextMarker, which can safeguard diverse forms of private information embedded in the training text data in LLMs. Specifically, TextMarker is a new membership inference framework that can eliminate the necessity for additional proxy data and surrogate model training, which are common in traditional membership inference techniques, thereby rendering our proposal significantly more practical and applicable.

Knowledge Graph



Sign up or login to leave a comment