語系:
繁體中文
English
說明(常見問題)
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.
紀錄類型:
書目-語言資料,手稿 : Monograph/item
正題名/作者:
Bergeron :/
其他題名:
Combating Adversarial Attacks by Emulating a Conscience.
作者:
Pisano, Matthew.
面頁冊數:
1 online resource (64 pages)
附註:
Source: Masters Abstracts International, Volume: 85-12.
Contained By:
Masters Abstracts International85-12.
標題:
Linguistics. -
電子資源:
click for full text (PQDT)
ISBN:
9798383060032
Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.
Pisano, Matthew.
Bergeron :
Combating Adversarial Attacks by Emulating a Conscience. - 1 online resource (64 pages)
Source: Masters Abstracts International, Volume: 85-12.
Thesis (M.S.)--Rensselaer Polytechnic Institute, 2024.
Includes bibliographical references
Artificial Intelligence alignment is the practice of encouraging an AI to behave in a manner that is compatible with human values and expectations. Research into this area has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). The most effective contemporary methods of alignment are primarily weight-based: modifying the internal weights of a model to better align its behavior with human preferences. An optimal alignment process results in an AI model that is maximally helpful to its user while generating minimally harmful responses. Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when faced with effective adversarial attacks. These deliberate attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, I introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Additionally, I demonstrate that a carefully chosen secondary model can effectively protect even much larger primary LLMs with a relatively minimal impact on Bergeron's resource usage.
Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024
Mode of access: World Wide Web
ISBN: 9798383060032Subjects--Topical Terms:
557829
Linguistics.
Subjects--Index Terms:
Adversarial attacksIndex Terms--Genre/Form:
554714
Electronic books.
Bergeron : = Combating Adversarial Attacks by Emulating a Conscience.
LDR
:03046ntm a22003857 4500
001
1150147
005
20241022111559.5
006
m o d
007
cr bn ---uuuuu
008
250605s2024 xx obm 000 0 eng d
020
$a
9798383060032
035
$a
(MiAaPQ)AAI30996428
035
$a
AAI30996428
040
$a
MiAaPQ
$b
eng
$c
MiAaPQ
$d
NTU
100
1
$a
Pisano, Matthew.
$3
1476581
245
1 0
$a
Bergeron :
$b
Combating Adversarial Attacks by Emulating a Conscience.
264
0
$c
2024
300
$a
1 online resource (64 pages)
336
$a
text
$b
txt
$2
rdacontent
337
$a
computer
$b
c
$2
rdamedia
338
$a
online resource
$b
cr
$2
rdacarrier
500
$a
Source: Masters Abstracts International, Volume: 85-12.
500
$a
Advisor: Si, Mei;Goldschmidt, David.
502
$a
Thesis (M.S.)--Rensselaer Polytechnic Institute, 2024.
504
$a
Includes bibliographical references
520
$a
Artificial Intelligence alignment is the practice of encouraging an AI to behave in a manner that is compatible with human values and expectations. Research into this area has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). The most effective contemporary methods of alignment are primarily weight-based: modifying the internal weights of a model to better align its behavior with human preferences. An optimal alignment process results in an AI model that is maximally helpful to its user while generating minimally harmful responses. Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when faced with effective adversarial attacks. These deliberate attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, I introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Additionally, I demonstrate that a carefully chosen secondary model can effectively protect even much larger primary LLMs with a relatively minimal impact on Bergeron's resource usage.
533
$a
Electronic reproduction.
$b
Ann Arbor, Mich. :
$c
ProQuest,
$d
2024
538
$a
Mode of access: World Wide Web
650
4
$a
Linguistics.
$3
557829
650
4
$a
Language.
$3
571568
653
$a
Adversarial attacks
653
$a
Alignment process
653
$a
Collaborative models
653
$a
Large Language Models
655
7
$a
Electronic books.
$2
local
$3
554714
690
$a
0800
690
$a
0679
690
$a
0290
710
2
$a
Rensselaer Polytechnic Institute.
$b
Computer Science.
$3
1190468
710
2
$a
ProQuest Information and Learning Co.
$3
1178819
773
0
$t
Masters Abstracts International
$g
85-12.
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30996428
$z
click for full text (PQDT)
筆 0 讀者評論
多媒體
評論
新增評論
分享你的心得
Export
取書館別
處理中
...
變更密碼[密碼必須為2種組合(英文和數字)及長度為10碼以上]
登入