46 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			46 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
\pagebreak
 | 
						|
\section*{Two Feed-Forward Layers = Attention over Parameters}\label{sec:parameter_attention}
 | 
						|
 | 
						|
In addition to attention layers, our model contains position-wise feed-forward networks (Section \ref{sec:ffn}), which consist of two linear transformations with a ReLU activation in between.  In fact, these networks too can be seen as a form of attention.  Compare the formula for such a network with the formula for a simple dot-product attention layer (biases and scaling factors omitted):
 | 
						|
 | 
						|
\begin{align*}
 | 
						|
    FFN(x, W_1, W_2) = ReLU(xW_1)W_2 \\
 | 
						|
    A(q, K, V) = Softmax(qK^T)V
 | 
						|
\end{align*}
 | 
						|
 | 
						|
Based on the similarity of these formulae, the two-layer feed-forward network can be seen as a kind of attention, where the keys and values are the rows of the trainable parameter matrices $W_1$ and $W_2$, and where we use ReLU instead of Softmax in the compatibility function.
 | 
						|
 | 
						|
%the compatablity function is $compat(q, k_i) = ReLU(q \cdot k_i)$ instead of $Softmax(qK_T)_i$.
 | 
						|
 | 
						|
Given this similarity, we experimented with replacing the position-wise feed-forward networks with attention layers similar to the ones we use everywhere else our model. The multi-head-attention-over-parameters sublayer is identical to the multi-head attention described in \ref{sec:multihead}, except that the "keys" and "values" inputs to each attention head are trainable model parameters, as opposed to being linear projections of a previous layer.  These parameters are scaled up by a factor of $\sqrt{d_{model}}$ in order to be more similar to activations.
 | 
						|
 | 
						|
In our first experiment, we replaced each position-wise feed-forward network with a multi-head-attention-over-parameters sublayer with $h_p=8$ heads, key-dimensionality $d_{pk}=64$, and value-dimensionality $d_{pv}=64$, using $n_p=1536$ key-value pairs for each attention head.  The sublayer has a total of $2097152$ parameters, including the parameters in the query projection and the output projection.  This matches the number of parameters in the position-wise feed-forward network that we replaced.  While the theoretical amount of computation is also the same, in practice, the attention version caused the step times to be about 30\% longer.
 | 
						|
 | 
						|
In our second experiment, we used $h_p=8$ heads, and $n_p=512$ key-value pairs for each attention head, again matching the total number of parameters in the base model.
 | 
						|
 | 
						|
Results for the first experiment were slightly worse than for the base model, and results for the second experiment were slightly better, see Table~\ref{tab:parameter_attention}.
 | 
						|
 | 
						|
\begin{table}[h]
 | 
						|
\caption{Replacing the position-wise feed-forward networks with multihead-attention-over-parameters produces similar results to the base model.  All metrics are on the English-to-German translation development set, newstest2013.}
 | 
						|
\label{tab:parameter_attention}
 | 
						|
\begin{center}
 | 
						|
\vspace{-2mm}
 | 
						|
%\scalebox{1.0}{
 | 
						|
\begin{tabular}{c|cccccc|cccc}
 | 
						|
\hline\rule{0pt}{2.0ex}
 | 
						|
 & \multirow{2}{*}{$\dmodel$} & \multirow{2}{*}{$\dff$} &
 | 
						|
\multirow{2}{*}{$h_p$} & \multirow{2}{*}{$d_{pk}$} & \multirow{2}{*}{$d_{pv}$} &
 | 
						|
 \multirow{2}{*}{$n_p$} &
 | 
						|
 PPL & BLEU & params & training\\
 | 
						|
 & & & & & &  & (dev) & (dev) & $\times10^6$ & time \\
 | 
						|
\hline\rule{0pt}{2.0ex}
 | 
						|
base & 512 & 2048 & & & & & 4.92 & 25.8 & 65 & 12 hours\\
 | 
						|
\hline\rule{0pt}{2.0ex}
 | 
						|
AOP$_1$ & 512 & & 8 & 64 & 64 & 1536 & 4.92& 25.5  & 65 & 16 hours\\
 | 
						|
AOP$_2$ & 512 & & 16 & 64 & 64 & 512 & \textbf{4.86} & \textbf{25.9}  & 65 & 16 hours \\
 | 
						|
\hline
 | 
						|
\end{tabular}
 | 
						|
%}
 | 
						|
\end{center}
 | 
						|
\end{table}
 |