46 lines
3.4 KiB
TeX
46 lines
3.4 KiB
TeX
\pagebreak
|
|
\section*{Two Feed-Forward Layers = Attention over Parameters}\label{sec:parameter_attention}
|
|
|
|
In addition to attention layers, our model contains position-wise feed-forward networks (Section \ref{sec:ffn}), which consist of two linear transformations with a ReLU activation in between. In fact, these networks too can be seen as a form of attention. Compare the formula for such a network with the formula for a simple dot-product attention layer (biases and scaling factors omitted):
|
|
|
|
\begin{align*}
|
|
FFN(x, W_1, W_2) = ReLU(xW_1)W_2 \\
|
|
A(q, K, V) = Softmax(qK^T)V
|
|
\end{align*}
|
|
|
|
Based on the similarity of these formulae, the two-layer feed-forward network can be seen as a kind of attention, where the keys and values are the rows of the trainable parameter matrices $W_1$ and $W_2$, and where we use ReLU instead of Softmax in the compatibility function.
|
|
|
|
%the compatablity function is $compat(q, k_i) = ReLU(q \cdot k_i)$ instead of $Softmax(qK_T)_i$.
|
|
|
|
Given this similarity, we experimented with replacing the position-wise feed-forward networks with attention layers similar to the ones we use everywhere else our model. The multi-head-attention-over-parameters sublayer is identical to the multi-head attention described in \ref{sec:multihead}, except that the "keys" and "values" inputs to each attention head are trainable model parameters, as opposed to being linear projections of a previous layer. These parameters are scaled up by a factor of $\sqrt{d_{model}}$ in order to be more similar to activations.
|
|
|
|
In our first experiment, we replaced each position-wise feed-forward network with a multi-head-attention-over-parameters sublayer with $h_p=8$ heads, key-dimensionality $d_{pk}=64$, and value-dimensionality $d_{pv}=64$, using $n_p=1536$ key-value pairs for each attention head. The sublayer has a total of $2097152$ parameters, including the parameters in the query projection and the output projection. This matches the number of parameters in the position-wise feed-forward network that we replaced. While the theoretical amount of computation is also the same, in practice, the attention version caused the step times to be about 30\% longer.
|
|
|
|
In our second experiment, we used $h_p=8$ heads, and $n_p=512$ key-value pairs for each attention head, again matching the total number of parameters in the base model.
|
|
|
|
Results for the first experiment were slightly worse than for the base model, and results for the second experiment were slightly better, see Table~\ref{tab:parameter_attention}.
|
|
|
|
\begin{table}[h]
|
|
\caption{Replacing the position-wise feed-forward networks with multihead-attention-over-parameters produces similar results to the base model. All metrics are on the English-to-German translation development set, newstest2013.}
|
|
\label{tab:parameter_attention}
|
|
\begin{center}
|
|
\vspace{-2mm}
|
|
%\scalebox{1.0}{
|
|
\begin{tabular}{c|cccccc|cccc}
|
|
\hline\rule{0pt}{2.0ex}
|
|
& \multirow{2}{*}{$\dmodel$} & \multirow{2}{*}{$\dff$} &
|
|
\multirow{2}{*}{$h_p$} & \multirow{2}{*}{$d_{pk}$} & \multirow{2}{*}{$d_{pv}$} &
|
|
\multirow{2}{*}{$n_p$} &
|
|
PPL & BLEU & params & training\\
|
|
& & & & & & & (dev) & (dev) & $\times10^6$ & time \\
|
|
\hline\rule{0pt}{2.0ex}
|
|
base & 512 & 2048 & & & & & 4.92 & 25.8 & 65 & 12 hours\\
|
|
\hline\rule{0pt}{2.0ex}
|
|
AOP$_1$ & 512 & & 8 & 64 & 64 & 1536 & 4.92& 25.5 & 65 & 16 hours\\
|
|
AOP$_2$ & 512 & & 16 & 64 & 64 & 512 & \textbf{4.86} & \textbf{25.9} & 65 & 16 hours \\
|
|
\hline
|
|
\end{tabular}
|
|
%}
|
|
\end{center}
|
|
\end{table}
|