國立虎尾科技大學 |

Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Bergeron :/
其他題名:	Combating Adversarial Attacks by Emulating a Conscience.
作者:	Pisano, Matthew.
面頁冊數:	1 online resource (64 pages)
附註:	Source: Masters Abstracts International, Volume: 85-12.
Contained By:	Masters Abstracts International85-12.
標題:	Language. -
電子資源:	click for full text (PQDT)
ISBN:	9798383060032

Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.
Pisano, Matthew.

Bergeron :Combating Adversarial Attacks by Emulating a Conscience. - 1 online resource (64 pages)

Source: Masters Abstracts International, Volume: 85-12.

Thesis (M.S.)--Rensselaer Polytechnic Institute, 2024.

Includes bibliographical references

Artificial Intelligence alignment is the practice of encouraging an AI to behave in a manner that is compatible with human values and expectations. Research into this area has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). The most effective contemporary methods of alignment are primarily weight-based: modifying the internal weights of a model to better align its behavior with human preferences. An optimal alignment process results in an AI model that is maximally helpful to its user while generating minimally harmful responses. Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when faced with effective adversarial attacks. These deliberate attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, I introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Additionally, I demonstrate that a carefully chosen secondary model can effectively protect even much larger primary LLMs with a relatively minimal impact on Bergeron's resource usage.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798383060032Subjects--Topical Terms:

571568
Language.
Subjects--Index Terms:

Adversarial attacksIndex Terms--Genre/Form:

554714
Electronic books.

Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.
LDR:03046ntm a22003857 4500 001 1150147
005 20241022111559.5
006 m o d
007 cr bn ---uuuuu
008 250605s2024 xx obm 000 0 eng d
020 $a 9798383060032
035 $a (MiAaPQ)AAI30996428
035 $a AAI30996428
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Pisano, Matthew. $3 1476581
245 1 0 $a Bergeron : $b Combating Adversarial Attacks by Emulating a Conscience.
264 0 $c 2024
300 $a 1 online resource (64 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Masters Abstracts International, Volume: 85-12.
500 $a Advisor: Si, Mei;Goldschmidt, David.
502 $a Thesis (M.S.)--Rensselaer Polytechnic Institute, 2024.
504 $a Includes bibliographical references
520 $a Artificial Intelligence alignment is the practice of encouraging an AI to behave in a manner that is compatible with human values and expectations. Research into this area has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). The most effective contemporary methods of alignment are primarily weight-based: modifying the internal weights of a model to better align its behavior with human preferences. An optimal alignment process results in an AI model that is maximally helpful to its user while generating minimally harmful responses. Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when faced with effective adversarial attacks. These deliberate attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, I introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Additionally, I demonstrate that a carefully chosen secondary model can effectively protect even much larger primary LLMs with a relatively minimal impact on Bergeron's resource usage.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Language. $3 571568
650 4 $a Linguistics. $3 557829
653 $a Adversarial attacks
653 $a Alignment process
653 $a Collaborative models
653 $a Large Language Models
655 7 $a Electronic books. $2 local $3 554714
690 $a 0800
690 $a 0679
690 $a 0290
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a Rensselaer Polytechnic Institute. $b Computer Science. $3 1190468
773 0 $t Masters Abstracts International $g 85-12.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30996428 $z click for full text (PQDT)