We present a boundary integral equation solver for wave scattering suited for manycore processors, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic and architecture-oriented optimizations are essential for achieving worthy performance. We use the GMRES iterative method and the fast multipole method (FMM) to implement the matrix-vector product (MatVec) kernel. We implement highly optimized kernels for both shared- and distributed-memory architectures and provide various performance models for tuning the task-based tree traversal implementation of FMM. We develop optimal architecture-specific and algorithm-aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. Task-based traversal achieves roughly 77\% and 60\% of the peak single precision floating point performance of a 56-core Skylake Scalable processor and a 72-core Knights Landing processor, respectively. These represent approximately sevenfold speedups compared to the baseline scalar code. We report weak scalability in accordance with the best possible O(log P) communication complexity and the theoretical scaling complexity of FMM. These result in up to 85\% efficiency in strong scaling while computing in excess of 2 billion degrees of freedom (DoFs) on the full-scale Cray XC40. To demonstrate the applicability of the proposed implementation of FMM, we use it in the solution of the acoustic scattering problem involving soft targets. This calls for the discretization of an integral equation with an oscillatory kernel and (weak) 1/R singularity. Applications of FMM to the problems involving hard targets, which call for implementation of more complicated singularity extraction/cancellation schemes to account for 1/R2 singularity, or to Burton-Miller surface integral formulation, which results in a well-conditioned matrix system, are not considered in this work, but the efficiency and scalability results demonstrated here are expected to be observed for these cases as well. The numerical results are converged to within 1.0e-4 relative 2-norm residual accuracy of the analytical solution for the sphere, and a multiple flower-shaped scatterer displays comparable algebraic convergence in a more general setting.