I Introduction
Federated learning (FL) provides a promising paradigm for edge computing, by taking advantages of parallel computing at edge devices and privacyaware access to rich distributed data[1, 2, 3, 4]. To achieve communicationefficient FL, sparsification[5, 6], quantization[7, 8, 9] and infrequent uploading of local updates [10, 11, 12, 13, 14] are developed to reduce the amount of data needed to transfer over digital wireless communications. However, the communication overhead and latency are still proportional to the number of local workers participated in FL over digital communication channels. To handle this issue, FL over the air (FLOA) is recently proposed as a new framework for distributed learning solutions [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], which exploits the overtheair computation (AirComp) principle[26, 27] for “oneshot” aggregation via local workers’ simultaneous update transmission over the same timefrequency resources. Based on the inherent waveform superposition property of wireless multiple access channels (MAC), AirComp allows to directly collect the gradient aggregation among local workers via concurrent transmission and computation [26, 27, 28], which exactly fits the need of FL for only an average of all distributed local gradients but not the individual values.
Benefitting from communicationefficient gradient aggregation, FLOA as a crossdisciplinary topic has attracted growing research interests in the fields of communications, optimization, and machine learning, such as power control
[19, 16, 29], devices scheduling[18, 16, 25], gradient compression[21, 22, 23, 15, 17], beamforming design[20, 30, 24] and learning rate optimization[31]. For instance, a broadband analog aggregation scheme for power control and device scheduling in FLOA is proposed in [18], where a set of tradeoffs between communications and learning are discussed. In [16], convergence analysis quantifies the impact of AirComp on FL and then a joint optimization of communication and learning for the optimal power scaling and device scheduling is proposed. Considering energyconstrained local devices, an energyaware device scheduling strategy is proposed in [25] to maximize the average number of workers scheduled for gradient update. For update compression, sparsification[22, 23], quantization[21] and compressivesensing based methods [15, 17] are proposed to further improve communication efficiency. In multiple antennas scenarios, a joint design of device scheduling and beamforming is presented in [20] to maximize the number of selected workers under the given mean square error (MSE) requirements. Since hyperparameters can also affect learning performance, a learning rate optimization scheme is proposed for multiantenna systems to further improve the MSE performance and the testing accuracy[31].Besides, FLOA not only improves communication efficiency, but also enhances the data privacy thanks to its inherent unaccessibility to individual local gradients, which thus prevent potential model inversion attacks, e.g., deep leakage from gradients[32]. While FLOA closes the doors to deep leakage from gradients, it leaves the windows open for adversaries to perform Byzantine attacks as well. In fact, even a single Byzantine fault can destroy FL. Byzantinerobust aggregation has been well studied for vanilla FL[33, 34, 35, 36], most of which uses a screening method, such as geometric median[37, 38, 39, 40], coordinatewise median[34], coordinatewise trimmed mean[34], Krum/MultiKrum[41], Bulyan[42, 43], Zeno/Zeno++[44, 45] and so on[33]
. The basic idea of these existing screening methods is to exclude outliers while aggregating the rest of local gradients. All of them hinge on the knowledge on the individual values of local gradients, which is however not accessible in FLOA due to the analog superposition of all local gradients over the air. Thus, the existing Byzantinerobust methods designed for vanilla FL cannot be applied to FLOA, which motivates us to design new Byzantineresilient approach customized for FLOA.
To the best of our knowledge, no existing literature has been found in study of Byzantine attacks to the overtheair transmissions and design of countermeasure for FLOA to against such attacks. In this work, we aim to deeply understand how Byzantine attacks affect FLOA and then provide the corresponding defense strategy. Our main technical contributions are threefold.

Given the fact that most prior works on FLOA adopt channel inversion (CI) power control (or its variants)[18, 16, 25, 21, 22, 15, 17, 46, 31, 47], we first theoretically prove that the performance of CI methods under fading channels can achieve performance approximating that of the ideal errorfree case, which explains why it is widely used to overcome the transmission errors in FLOA. Meanwhile, our analysis further reveals that the defensive capacity of CI is very limited to Byzantine attacks. Thus, we propose a new robust transmission policy to against Byzantine attacks, named the best effort voting (BEV) power control policy, where local workers transmit their local gradients with their maximum power.

To study the impact of Byzantine attacks to FLOA, we theoretically prove that there exists an extreme case as the strongest attack for a Byzantine attacker in order to prevent FLOA from converging to the correct updating direction. As this is the strongest attack, attackers may want to follow this guideline to design their transmission policy to attack FLOA.

To demonstrate the effectiveness of our proposed BEV method compared with the existing most popular CI scheme under the strongest attacks, we provide the convergence analysis for both our BEV and the existing CI. Our theoretical results prove that BEV outperforms CI in terms of better convergence behavior under the strongest Byzantine attacks.
We also test the proposed method on the image classification problems using the MNIST dataset. Simulation results show that the learning performance of BEV is slightly worse than that of CI when there is no Byzantine attacks, while BEV turns to significantly outperform CI in terms of the robustness to against Byzantine attacks. Thus, our theoretical analysis and simulation results suggest that BEV is preferred over CI in practical applications where it is unpredictable whether there exists Byzantine attacks.
The rest of this paper is organized as follows. The system model for FLOA is presented in Section II, where we provide two power control policies i.e., CI and BEV. The closedform expressions of their expected convergence rate are derived to compare the performance of different power control policies in Section III, where we also delineate the strongest attack case for a Byzantine attacker. Simulation results are provided in Section IV, followed by conclusions in Section V.
Ii System Model
Iia Federated Learning Model
Consider a distributed computation model with one parameter server (PS) and local workers. Each local worker stores data points, each of which is sampled independently from . That is, all workers have independent and identically distributed (i.i.d) datasets [34, 35, 36]. Denote as the th data of the th local worker. Let
denote a loss function of a parameter vector
of dimension associated with the data point . The corresponding population loss function is denoted as . The PS and local workers collaboratively learn a model defined by the parameter that minimizes the population loss:(1) 
The minimization of is typically carried out through stochastic gradient descent (SGD) algorithm. The model parameter at the iteration is updated as
(2) 
where is the learning rate and is the local gradient computed at the th local worker using its randomly selected the th data sample.
Assume that out of local workers are Byzantine attackers, and the remaining local workers are normal. To achieve (2), the PS needs to communicate with the local workers to collect and broadcast gradients through certain predefined protocol. However, the Byzantine attackers do not need to follow this protocol and can send arbitrary messages to the PS. Even worse, these attackers may have complete knowledge of the learning system and algorithms, and can collude with each other. Further, the communications between the PS and local workers inevitably introduce channel noises, while Byzantine attackers could also exploit this opportunity to disrupt FLOA. Next, we will show that different predefined analog aggregation transmission protocols result in different performance of FLOA in the presence of Byzantine attacks.
IiB Analog Aggregation Transmission Model
In FLOA, to exploit overtheair computation for lowlatency gradient aggregation, local gradients are amplitudemodulated for analog transmission and simultaneously transmitted from local workers to the PS through the same multiaccess channel. Assume symbollevel synchronization is achieved among the local workers through a synchronization channel [18]. To facilitate the powercontrol design, the transmitted symbols, denoted by
, are standardized such that they have zero mean and unit variance, i.e.,
. In this way, the powercontrol policy can be designed at the PS without knowledge of the specific transmitted symbols. Note that the standardization factors are uniform for all local gradients and therefore can be inverted at the PS.Since the statistics of the gradients may change over iterations, the standardization is executed in all communication rounds. Specifically, at the beginning of each communication round, each local worker estimates its mean and variance of the locally learnt gradient, denoted by
and , respectively. Then the locally estimated mean and variance are transmitted to the PS for global gradient statistics estimation by averaging. Given the received and , the PS averages all the local estimates to get the global estimates of the mean and variance of the gradient as and . Then the estimated and are broadcast back to the local workers and used for the standardization.After receiving the standardization factors and , each local worker performs the transmit signal standardization as follows:
(3) 
where is an allone vector with dimension equal to that of .
Considering only two symbols transmitted in each communication round, the individual locally estimated mean and variance are collected at the PS one by one. We assume such communications for standardization are noisefree without introducing errors. Note that the Byzantine attackers know the designed standardization method, and they would send the true mean and variance of their local gradients to avoid exposing themselves during the standardization stage. Otherwise, the attackers may be easily detected and then filtered out by the PS, as the normal workers and Byzantine workers have i.i.d. datasets.
After standardization, all local workers transmit their standardized local gradients to the PS with certain transmit power (the design of power control on will be discussed later in this section). The transmission of each local worker is subject to the transmit power constraint:
(4) 
Thus the power constraint boils down to .
On the other hand, the Byzantine attackers can send any signal
as its gradient update to the PS so as to skew FL. The transmit power
of the th Byzantine attackers satisfies(5) 
Considering block fading channels, where the wireless channels remain unchanged within each iteration in FL but may change independently from one iteration to another. We define the duration of one iteration as one time block, indexed by . At the th iteration, the received signals at the PS is given by
(6) 
where the first, second, and third terms correspond to normal workers, attackers and noise, respectively. In particular, is the channel gain from the th worker to the PS at the th iteration and is additive white Gaussian noise (AWGN) that is independent of the gradient updates. The channels follow independent Rayleigh fading, i.e., . In this work, we assume that the channels are perfectly known at local workers and the PS. With perfect channel state information (CSI), the channel phase offset is compensated at the local workers before they transmit their gradient updates.
After receiving the signals in (6) from the local workers, the PS performs destandardization to get the estimated aggregated gradient by inverting the standardization of (3) as follows:
(7) 
where the first term corresponds to the aggregated gradients from normal local workers, the second plus the third terms denote the attacking contributions of Byzantine attackers to the gradient update, and the final term is from the noise.
By using the estimated aggregated gradient, the global model parameters is updated at the th iteration by
(8) 
Next, we discuss two transmit power allocation schemes for the design of that are adopted by normal local workers: the existing channelinversion (CI) transmission [18, 21] and our proposed best effort voting (BEV) scheme.
IiB1 ChannelInversion Transmission Scheme
Given perfect known CSI, in the CI scheme[18, 21], channels are inverted by power control so that gradient parameters transmitted by different local workers are received with identical amplitudes, which leads to amplitude alignment at the PS. The transmit power of the th local worker is given by where is a scaling factor used to satisfy the power constraint in (4).
It is evident that
(9) 
where . Hence we can set for the power allocation. Since the channel coefficient is Rayleigh distributed ,
follows the exponential distribution with mean
. Thus, we have . As a result, for fulfilling the channelinversion scheme in practice, the transmit power of the th local worker is set as(10) 
where we set .
IiB2 The Proposed Best Effort Voting Scheme
To counter intelligent Byzantine attackers, our idea is to let normal local workers try their best to combat the impact of potential Byzantine attacks and to guide the FLOA to converge to the right direction, which is therefore named as the best effort voting (BEV) scheme. In the BEV scheme, normal local workers transmit their local gradients by using their maximum transmit power which is independent to their CSI knowledge. The transmit power of the th local worker in BEV scheme is given by
(11) 
In the next section, we will discuss and compare the robustness of these two different power allocation schemes in terms of the convergence behavior of FLOA in the presence of Byzantine attacks. Different power allocation schemes have different resilience against Byzantine attackers, we will discuss in the next section.
Iii The Convergence Analysis
In this section, we compare the convergence performance of the aforementioned two different power allocation schemes. We first prove that there exists a strongest attack where a Byzantine attacker tries its best to prevent the convergence of FLOA. And then under such a circumstance, we derive the convergence rates of FLOA when applying the two transmission schemes, respectively.
Iiia Assumptions
To facilitate the convergence analysis, we make several standard assumptions on the loss function and computed gradient estimates. Note that our theoretical derivations do not assume convex setting on the loss function. Therefore, our methodology is applicable to the popular deep neural networks (DNNs) applications.
Assumption 1: We assume the Lipschitz continuity and smoothness of the loss function , and thus we get[48]
(12) 
Assumption 2:
It is assumed that the stochastic local gradient estimates are independent and unbiased estimates of the global gradient with the bounded variance
[49, 21], i.e.,(13)  
(14) 
where we consider the standard SGD in this work. If the minibatched SGD with a size is applied, then the variance is bounded by .
Assumption 3: The standardization factors and are unbiased estimates of the global gradient with the bounded variance as follows[18]
(15)  
(16) 
The above assumptions allow tractable convergence analysis as follows.
IiiB The Strongest Attack Case of Byzantine Attacks
While the Byzantine attackers may send arbitrary signals to destroy FLOA, there exists the strongest attack that a Byzantine attacker can achieve to prevent the convergence of FLOA. Intuitively, the Byzantine attackers would like the global gradient make a Uturn to be updated at the PS as in the opposite direction of normal local workers. To this end, the Byzantine attackers will transmit to the PS with its maximum transmit power . In particular, given the global model parameter , the Byzantine attackers compute its own gradient by using their own local data. In addition, the transmit power satisfies the maximum power constraint, i.e., . This is the worst case that FLOA experiences in this work and we theoretically demonstrate in the following Theorem 1 that it is the strongest attack that a Byzantine attacker can achieve to prevent the convergence of FLOA.
Theorem 1.
Considering SGD for the FL system deploying analog aggregation transmission in the presence of Byzantine attackers, the strongest attacks can be performed as
(17)  
(18) 
Since the aforementioned strongest attack has been proved as the worst case that FLOA experiences, next we will evaluate the defensive efficiency of different transmission schemes via convergence analysis. We adopt the well known strategy of relating the norm of the gradient to the expected improvement to show the convergence for nonconvex optimization[50, 49, 21], i.e,
(19) 
where is the order of the cumulative number of the iterations . As we can see, if (19) holds, the norm of the gradient is expected to converge to 0 as increases to infinity, which means that FL converges after iterations. The convergence rate depends on the order value .
IiiC The Convergence of SGD with ChannelInversion Transmission
With CSI at each local worker, the CI power control can be performed as (10). The resultant convergence rate of the CI transmission scheme under the strongest attacks is derived as follows.
Theorem 2.
Considering SGD for the FL system deploying analog aggregation transmission with the CI power control and Byzantine attackers taking the strongest attacks as in (17)(18), the convergence rate is given by
(20) 
where
(21)  
(22) 
and we set the learning rate , where is a positive constant satisfying , and is the cumulative number of the communication rounds. The convergence is guaranteed if .
Remark 1.
Considering a small learning rate, the asymptotical convergence rate is dominated by . In addition, the convergence condition is given by , the proof of which is also provided in Appendix B. This condition implies that the FL converges as long as we set a small enough learning rate. From this convergence condition, we can see that even one Byzantine attacker (once it has a very large transmit power or its channel gain is very large) can destroy the FLOA.
Remark 2.
Considering a special case where all the local workers have the same maximum power (i.e., , ) and the independent and identically distributed channels (i.e., , ), we have the convergence condition . Therefore, we conclude that the number of attackers that the CI scheme can survive in this special case is limited to be no more than .
When there are no Byzantine attackers, i.e., , we have the following Lemma 1.
Lemma 1.
Considering SGD for the FL system deploying analog aggregation transmission with CI power control and no Byzantine attackers, the convergence rate is given by
(23) 
where .
Proof.
When , we have . Then setting , substituting , and into (20), we complete the proof. ∎
Remark 3.
As we can see, in the case of CI power control without Byzantine attackers, we get the fastest asymptotical convergence rate as , which is the same as the errorfree (EF) case where we do not consider the influence of wireless channels and noises.
IiiD The Convergence of SGD with BEV Transmission
For our BEV transmission scheme under the strongest attacks, the resultant convergence rate is derived as following Theorem 3.
Theorem 3.
Considering SGD for the FL system deploying analog aggregation transmission with BEV power control and Byzantine attackers taking the strongest attacks as in (17)(18), the convergence rate is given by
(24) 
where
(25)  
(26) 
and we set the learning rate , where is a positive constant satisfying , and is the cumulative number of the communication rounds. The convergence is always guaranteed since always holds.
Remark 4.
The proof of the convergence condition is also provided in Appendix C, which holds all the time. This implies that our BEV scheme can defend any Byzantine attacks, as long as we set a small enough learning rate. Note that the BEV scheme can always converge, but the convergence rate of BEV decreases due to the Byzantine attacks.
Remark 5.
Considering a small learning rate, if both the CI scheme and our BEV scheme can converge, the asymptotical convergence rate is dominated by . The comparison between and depends on the specific parameters. Considering a large learning rate, if both the CI scheme and our BEV scheme can converge, the asymptotical convergence rate is dominated by . Since , the convergence rate of BEV scheme is faster than that of the CI scheme.
Remark 6.
When there are no Byzantine attackers, i.e., , we still have . Considering a small learning rate, the asymptotical convergence rate of BEV is dominated by , which is slower than both the CI scheme and the EF case.
Iv Simulation Results
To evaluate the resilience of our proposed BEV scheme against Byzantine attacks, we provide the simulation results for an image classification task. Unless specified otherwise, the simulation settings are given as follows. We consider that the FLOA system has workers. The wireless channels between the workers and the PS are modeled as i.i.d. Rayleigh fading, by generating
’s from the complex Gaussian distribution
for different and . We set the average receive SNR at local workers, defined as dB [21].We consider the learning task of handwrittendigit identification using the wellknown MNIST dataset^{1}^{1}1http://yann.lecun.com/exdb/mnist/
that consists of 10 classes ranging from digit “0” to “9”. In the MNIST dataset, a total of 60000 labeled training data samples and 10000 test samples are available for training and testing a learning model. In our experiments, we train a multilayer perceptron (MLP) with a 784neuron input layer, a 64neuron hidden layer, and a 10neuron softmax output layer. We adopt rectified linear unit (ReLU) as the activation function, and cross entropy as the loss function. The total number of parameters in the MLP is
. We randomly select distinct training samples and distribute them to all local workers as their different local datasets, i.e., , for any .We evaluate our BEV scheme under different attacks, including 1) without any attacks, 2) only one attacker who is far to the PS, 3) only one attacker who is close to the PS, and 4) randomly selected several attackers. We compare two benchmarks with our BEV scheme, including 1) the CI scheme and 2) the FLOA under the ideal EF case where we do not consider the influence of wireless channels and noises.
Iva Performance without Attacks
The EF case is set as the benchmark where the local gradients are perfectly aggregated at the PS, i.e., we set the channel and the AWGN . In Fig 1, we compare the performance of BEV with CI and EF without Byzantine attacks. We set adjusting factor of the learning rate, define . As we can see, the performance of CI is almost the same as EF. However, the performance of BEV is 2% loss compared to CI and EF. This results are in agreement with our theoretical analysis in Theorem 3, which has been discussed in Remark 6. That is, CI converges a little faster than our BEV scheme, if and only if there is no Byzantine attackers, which however cannot be guaranteed in practice of adversarial environments.
IvB Performance under single Attacker With the Lowest Channel Gain
In Fig 2, we compare the performance of BEV with CI under single Byzantine attack. We consider the local worker whose channel gain is the lowest as the Byzantine attacker. The Byzantine attacker still adopt the strongest attack strategy to destroy FLOA. Under different adjusting factors of the learning rate , we compare the performance of BEV with CI. Since the Byzantine attacker’s channel gain is the lowest among all local workers, the overall impact of its attack to FLOA is relatively weak. In this case, both BEV and CI can converge, if a proper learning rate is selected. On the other hand, when the learning rate is not properly chosen e.g. when in Fig. 2, BEV can converge but CI fails. When , both BEV and CI can converge, but the convergence rate of BEV is faster than that of CI. This is because considering a large learning rate, the asymptotical convergence rate is dominated by and . When , the performance of BEV is a little bit weaker in performance than CI. In practice, in the case of guaranteed convergence, we prefer a large learning rate to achieve a fast convergence rate. Under a large learning rate, e.g., , our BEV works better than CI.
IvC Performance under single Attacker with the Highest Channel Gain
In Fig 3, we compare the performance of BEV with CI under a Byzantine attacker whose channel gain is the highest. Due to the highest channel gain of the Byzantine attacker, we consider its attack as a strong attack. In this case of strong attacks, we compare the performance of BEV with CI under different adjusting factors of the learning rate . Since the convergence condition is hard to guarantee, it can be seen from Fig 3 that CI cannot converge or coverage to a failure situation. As the decrease of , it is useful for CI to converge to the right direction, but it still cannot defend the attack after a few iterations. On the other hand, BEV can still converge even in this strong attack case. Thus, if there is a strong attack, BEV is a better choice than CI. In addition, the convergence rate decreases as decreases. This implies that a larger learning rate is preferred under the condition of guaranteed convergence.
IvD Performance with Multiple Randomly Selected Attackers
In Fig 4, we compare the performance of BEV with CI under the different number of Byzantine attackers. As we can see, when the number of Byzantine attackers is less than 4, both BEV and CI can converge, but the convergence rate decreases as the number of Byzantine attackers increases. When the number of Byzantine attackers is 4, i.e., , CI can not converge to the correct direction, while BEV still converges in the correct direction but it converges at a slower rate. These results are consistent with our discussions in Remark 2 and Remark 4.
V Conclusion
This paper studies the robustness of FL over the air (FLOA) against Byzantine attacks. We provide theoretical analysis on convergence performance of different transmission schemes. Our analytical results reveal the strongest attacking strategy that Byzantine attackers can achieve to prevent FLOA from converging to the correct direction. Through our convergence analysis, we find that: in the absence of any Byzantine attacker, CI has the performance comparable to the ideal EF, while BEV has 2% performance loss. In the weakest Byzantine attack, considering a large learning rate, both CI and BEV can converge while BEV converges faster than CI. If there exists a strong Byzantine attacker, the convergence of CI cannot be guaranteed, but BEV can still converge. In practice, since it is impossible to determine the intensity of potential attacks, BEV is the better option to against Byzantine attacks because it performs well under various attack situations.
Acknowledgments
This work was partly supported by the National Natural Science Foundation of China (Grant Nos. 61871023 and 61931001), Beijing Natural Science Foundation (Grant No. 4202054), and the National Science Foundation of the US (Grant Nos. 1741338 and 1939553).
Appendix A Proof of Theorem 1
Given the estimates of global gradient in (IIB), we have the update rule for model parameters as follows
(27) 
Rewriting this inequality and taking the expectation, we have
(29) 
Using Jensen’s inequality , it has
(30) 
Substituting (30) to (A), we get
(31) 
As we can see, if , the objective decreases monotonically, i.e., FL would converge. We can set a small enough learning rate to ensure, as long as we have
(32) 
In order to prevent convergence, the Byzantine attackers are supposed to spend their efforts to make . Thus, the best way for them is to send with their maximum power so as to make .
Given the power constraint in (5), we have
(33) 
As a result, the Byzantine attackers are supposed to send with their maximum power .
Appendix B Proof of Theorem 2
Given the estimates of global gradient in (IIB), the power allocation policy in (10), and the strongest attacks in Theorem 1, we have the update rule for model parameters as follows
(34) 
Rewriting this inequality and taking the expectation, we get
(36) 
where , because of the Rayleigh distributed .
Using Jensen’s inequality, we have
(37) 
Comments
There are no comments yet.