The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/contrib/openzfs/module/icp/asm-x86_64/modes/ghash-x86_64.S

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 # Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved.
    2 #
    3 # Licensed under the Apache License 2.0 (the "License").  You may not use
    4 # this file except in compliance with the License.  You can obtain a copy
    5 # in the file LICENSE in the source distribution or at
    6 # https://www.openssl.org/source/license.html
    7 
    8 #
    9 # ====================================================================
   10 # Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
   11 # project. The module is, however, dual licensed under OpenSSL and
   12 # CRYPTOGAMS licenses depending on where you obtain it. For further
   13 # details see http://www.openssl.org/~appro/cryptogams/.
   14 # ====================================================================
   15 #
   16 # March, June 2010
   17 #
   18 # The module implements "4-bit" GCM GHASH function and underlying
   19 # single multiplication operation in GF(2^128). "4-bit" means that
   20 # it uses 256 bytes per-key table [+128 bytes shared table]. GHASH
   21 # function features so called "528B" variant utilizing additional
   22 # 256+16 bytes of per-key storage [+512 bytes shared table].
   23 # Performance results are for this streamed GHASH subroutine and are
   24 # expressed in cycles per processed byte, less is better:
   25 #
   26 #               gcc 3.4.x(*)    assembler
   27 #
   28 # P4            28.6            14.0            +100%
   29 # Opteron       19.3            7.7             +150%
   30 # Core2         17.8            8.1(**)         +120%
   31 # Atom          31.6            16.8            +88%
   32 # VIA Nano      21.8            10.1            +115%
   33 #
   34 # (*)   comparison is not completely fair, because C results are
   35 #       for vanilla "256B" implementation, while assembler results
   36 #       are for "528B";-)
   37 # (**)  it's mystery [to me] why Core2 result is not same as for
   38 #       Opteron;
   39 
   40 # May 2010
   41 #
   42 # Add PCLMULQDQ version performing at 2.02 cycles per processed byte.
   43 # See ghash-x86.pl for background information and details about coding
   44 # techniques.
   45 #
   46 # Special thanks to David Woodhouse for providing access to a
   47 # Westmere-based system on behalf of Intel Open Source Technology Centre.
   48 
   49 # December 2012
   50 #
   51 # Overhaul: aggregate Karatsuba post-processing, improve ILP in
   52 # reduction_alg9, increase reduction aggregate factor to 4x. As for
   53 # the latter. ghash-x86.pl discusses that it makes lesser sense to
   54 # increase aggregate factor. Then why increase here? Critical path
   55 # consists of 3 independent pclmulqdq instructions, Karatsuba post-
   56 # processing and reduction. "On top" of this we lay down aggregated
   57 # multiplication operations, triplets of independent pclmulqdq's. As
   58 # issue rate for pclmulqdq is limited, it makes lesser sense to
   59 # aggregate more multiplications than it takes to perform remaining
   60 # non-multiplication operations. 2x is near-optimal coefficient for
   61 # contemporary Intel CPUs (therefore modest improvement coefficient),
   62 # but not for Bulldozer. Latter is because logical SIMD operations
   63 # are twice as slow in comparison to Intel, so that critical path is
   64 # longer. A CPU with higher pclmulqdq issue rate would also benefit
   65 # from higher aggregate factor...
   66 #
   67 # Westmere      1.78(+13%)
   68 # Sandy Bridge  1.80(+8%)
   69 # Ivy Bridge    1.80(+7%)
   70 # Haswell       0.55(+93%) (if system doesn't support AVX)
   71 # Broadwell     0.45(+110%)(if system doesn't support AVX)
   72 # Skylake       0.44(+110%)(if system doesn't support AVX)
   73 # Bulldozer     1.49(+27%)
   74 # Silvermont    2.88(+13%)
   75 # Knights L     2.12(-)    (if system doesn't support AVX)
   76 # Goldmont      1.08(+24%)
   77 
   78 # March 2013
   79 #
   80 # ... 8x aggregate factor AVX code path is using reduction algorithm
   81 # suggested by Shay Gueron[1]. Even though contemporary AVX-capable
   82 # CPUs such as Sandy and Ivy Bridge can execute it, the code performs
   83 # sub-optimally in comparison to above mentioned version. But thanks
   84 # to Ilya Albrekht and Max Locktyukhin of Intel Corp. we knew that
   85 # it performs in 0.41 cycles per byte on Haswell processor, in
   86 # 0.29 on Broadwell, and in 0.36 on Skylake.
   87 #
   88 # Knights Landing achieves 1.09 cpb.
   89 #
   90 # [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
   91 
   92 # Generated once from
   93 # https://github.com/openssl/openssl/blob/5ffc3324/crypto/modes/asm/ghash-x86_64.pl
   94 # and modified for ICP. Modification are kept at a bare minimum to ease later
   95 # upstream merges.
   96 
   97 #if defined(__x86_64__) && defined(HAVE_AVX) && \
   98     defined(HAVE_AES) && defined(HAVE_PCLMULQDQ)
   99 
  100 #define _ASM
  101 #include <sys/asm_linkage.h>
  102 
  103 .text
  104 
  105 /* Windows userland links with OpenSSL */
  106 #if !defined (_WIN32) || defined (_KERNEL)
  107 ENTRY_ALIGN(gcm_gmult_clmul, 16)
  108 
  109 .cfi_startproc
  110         ENDBR
  111 
  112 .L_gmult_clmul:
  113         movdqu  (%rdi),%xmm0
  114         movdqa  .Lbswap_mask(%rip),%xmm5
  115         movdqu  (%rsi),%xmm2
  116         movdqu  32(%rsi),%xmm4
  117 .byte   102,15,56,0,197
  118         movdqa  %xmm0,%xmm1
  119         pshufd  $78,%xmm0,%xmm3
  120         pxor    %xmm0,%xmm3
  121 .byte   102,15,58,68,194,0
  122 .byte   102,15,58,68,202,17
  123 .byte   102,15,58,68,220,0
  124         pxor    %xmm0,%xmm3
  125         pxor    %xmm1,%xmm3
  126 
  127         movdqa  %xmm3,%xmm4
  128         psrldq  $8,%xmm3
  129         pslldq  $8,%xmm4
  130         pxor    %xmm3,%xmm1
  131         pxor    %xmm4,%xmm0
  132 
  133         movdqa  %xmm0,%xmm4
  134         movdqa  %xmm0,%xmm3
  135         psllq   $5,%xmm0
  136         pxor    %xmm0,%xmm3
  137         psllq   $1,%xmm0
  138         pxor    %xmm3,%xmm0
  139         psllq   $57,%xmm0
  140         movdqa  %xmm0,%xmm3
  141         pslldq  $8,%xmm0
  142         psrldq  $8,%xmm3
  143         pxor    %xmm4,%xmm0
  144         pxor    %xmm3,%xmm1
  145 
  146 
  147         movdqa  %xmm0,%xmm4
  148         psrlq   $1,%xmm0
  149         pxor    %xmm4,%xmm1
  150         pxor    %xmm0,%xmm4
  151         psrlq   $5,%xmm0
  152         pxor    %xmm4,%xmm0
  153         psrlq   $1,%xmm0
  154         pxor    %xmm1,%xmm0
  155 .byte   102,15,56,0,197
  156         movdqu  %xmm0,(%rdi)
  157         RET
  158 .cfi_endproc
  159 SET_SIZE(gcm_gmult_clmul)
  160 #endif /* !_WIN32 || _KERNEL */
  161 
  162 ENTRY_ALIGN(gcm_init_htab_avx, 32)
  163 .cfi_startproc
  164         ENDBR
  165         vzeroupper
  166 
  167         vmovdqu (%rsi),%xmm2
  168         // KCF/ICP stores H in network byte order with the hi qword first
  169         // so we need to swap all bytes, not the 2 qwords.
  170         vmovdqu .Lbswap_mask(%rip),%xmm4
  171         vpshufb %xmm4,%xmm2,%xmm2
  172 
  173 
  174         vpshufd $255,%xmm2,%xmm4
  175         vpsrlq  $63,%xmm2,%xmm3
  176         vpsllq  $1,%xmm2,%xmm2
  177         vpxor   %xmm5,%xmm5,%xmm5
  178         vpcmpgtd        %xmm4,%xmm5,%xmm5
  179         vpslldq $8,%xmm3,%xmm3
  180         vpor    %xmm3,%xmm2,%xmm2
  181 
  182 
  183         vpand   .L0x1c2_polynomial(%rip),%xmm5,%xmm5
  184         vpxor   %xmm5,%xmm2,%xmm2
  185 
  186         vpunpckhqdq     %xmm2,%xmm2,%xmm6
  187         vmovdqa %xmm2,%xmm0
  188         vpxor   %xmm2,%xmm6,%xmm6
  189         movq    $4,%r10
  190         jmp     .Linit_start_avx
  191 .balign 32
  192 .Linit_loop_avx:
  193         vpalignr        $8,%xmm3,%xmm4,%xmm5
  194         vmovdqu %xmm5,-16(%rdi)
  195         vpunpckhqdq     %xmm0,%xmm0,%xmm3
  196         vpxor   %xmm0,%xmm3,%xmm3
  197         vpclmulqdq      $0x11,%xmm2,%xmm0,%xmm1
  198         vpclmulqdq      $0x00,%xmm2,%xmm0,%xmm0
  199         vpclmulqdq      $0x00,%xmm6,%xmm3,%xmm3
  200         vpxor   %xmm0,%xmm1,%xmm4
  201         vpxor   %xmm4,%xmm3,%xmm3
  202 
  203         vpslldq $8,%xmm3,%xmm4
  204         vpsrldq $8,%xmm3,%xmm3
  205         vpxor   %xmm4,%xmm0,%xmm0
  206         vpxor   %xmm3,%xmm1,%xmm1
  207         vpsllq  $57,%xmm0,%xmm3
  208         vpsllq  $62,%xmm0,%xmm4
  209         vpxor   %xmm3,%xmm4,%xmm4
  210         vpsllq  $63,%xmm0,%xmm3
  211         vpxor   %xmm3,%xmm4,%xmm4
  212         vpslldq $8,%xmm4,%xmm3
  213         vpsrldq $8,%xmm4,%xmm4
  214         vpxor   %xmm3,%xmm0,%xmm0
  215         vpxor   %xmm4,%xmm1,%xmm1
  216 
  217         vpsrlq  $1,%xmm0,%xmm4
  218         vpxor   %xmm0,%xmm1,%xmm1
  219         vpxor   %xmm4,%xmm0,%xmm0
  220         vpsrlq  $5,%xmm4,%xmm4
  221         vpxor   %xmm4,%xmm0,%xmm0
  222         vpsrlq  $1,%xmm0,%xmm0
  223         vpxor   %xmm1,%xmm0,%xmm0
  224 .Linit_start_avx:
  225         vmovdqa %xmm0,%xmm5
  226         vpunpckhqdq     %xmm0,%xmm0,%xmm3
  227         vpxor   %xmm0,%xmm3,%xmm3
  228         vpclmulqdq      $0x11,%xmm2,%xmm0,%xmm1
  229         vpclmulqdq      $0x00,%xmm2,%xmm0,%xmm0
  230         vpclmulqdq      $0x00,%xmm6,%xmm3,%xmm3
  231         vpxor   %xmm0,%xmm1,%xmm4
  232         vpxor   %xmm4,%xmm3,%xmm3
  233 
  234         vpslldq $8,%xmm3,%xmm4
  235         vpsrldq $8,%xmm3,%xmm3
  236         vpxor   %xmm4,%xmm0,%xmm0
  237         vpxor   %xmm3,%xmm1,%xmm1
  238         vpsllq  $57,%xmm0,%xmm3
  239         vpsllq  $62,%xmm0,%xmm4
  240         vpxor   %xmm3,%xmm4,%xmm4
  241         vpsllq  $63,%xmm0,%xmm3
  242         vpxor   %xmm3,%xmm4,%xmm4
  243         vpslldq $8,%xmm4,%xmm3
  244         vpsrldq $8,%xmm4,%xmm4
  245         vpxor   %xmm3,%xmm0,%xmm0
  246         vpxor   %xmm4,%xmm1,%xmm1
  247 
  248         vpsrlq  $1,%xmm0,%xmm4
  249         vpxor   %xmm0,%xmm1,%xmm1
  250         vpxor   %xmm4,%xmm0,%xmm0
  251         vpsrlq  $5,%xmm4,%xmm4
  252         vpxor   %xmm4,%xmm0,%xmm0
  253         vpsrlq  $1,%xmm0,%xmm0
  254         vpxor   %xmm1,%xmm0,%xmm0
  255         vpshufd $78,%xmm5,%xmm3
  256         vpshufd $78,%xmm0,%xmm4
  257         vpxor   %xmm5,%xmm3,%xmm3
  258         vmovdqu %xmm5,0(%rdi)
  259         vpxor   %xmm0,%xmm4,%xmm4
  260         vmovdqu %xmm0,16(%rdi)
  261         leaq    48(%rdi),%rdi
  262         subq    $1,%r10
  263         jnz     .Linit_loop_avx
  264 
  265         vpalignr        $8,%xmm4,%xmm3,%xmm5
  266         vmovdqu %xmm5,-16(%rdi)
  267 
  268         vzeroupper
  269         RET
  270 .cfi_endproc
  271 SET_SIZE(gcm_init_htab_avx)
  272 
  273 #if !defined (_WIN32) || defined (_KERNEL)
  274 ENTRY_ALIGN(gcm_gmult_avx, 32)
  275 .cfi_startproc
  276         ENDBR
  277         jmp     .L_gmult_clmul
  278 .cfi_endproc
  279 SET_SIZE(gcm_gmult_avx)
  280 
  281 ENTRY_ALIGN(gcm_ghash_avx, 32)
  282 .cfi_startproc
  283         ENDBR
  284         vzeroupper
  285 
  286         vmovdqu (%rdi),%xmm10
  287         leaq    .L0x1c2_polynomial(%rip),%r10
  288         leaq    64(%rsi),%rsi
  289         vmovdqu .Lbswap_mask(%rip),%xmm13
  290         vpshufb %xmm13,%xmm10,%xmm10
  291         cmpq    $0x80,%rcx
  292         jb      .Lshort_avx
  293         subq    $0x80,%rcx
  294 
  295         vmovdqu 112(%rdx),%xmm14
  296         vmovdqu 0-64(%rsi),%xmm6
  297         vpshufb %xmm13,%xmm14,%xmm14
  298         vmovdqu 32-64(%rsi),%xmm7
  299 
  300         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  301         vmovdqu 96(%rdx),%xmm15
  302         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  303         vpxor   %xmm14,%xmm9,%xmm9
  304         vpshufb %xmm13,%xmm15,%xmm15
  305         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  306         vmovdqu 16-64(%rsi),%xmm6
  307         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  308         vmovdqu 80(%rdx),%xmm14
  309         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  310         vpxor   %xmm15,%xmm8,%xmm8
  311 
  312         vpshufb %xmm13,%xmm14,%xmm14
  313         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  314         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  315         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  316         vmovdqu 48-64(%rsi),%xmm6
  317         vpxor   %xmm14,%xmm9,%xmm9
  318         vmovdqu 64(%rdx),%xmm15
  319         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  320         vmovdqu 80-64(%rsi),%xmm7
  321 
  322         vpshufb %xmm13,%xmm15,%xmm15
  323         vpxor   %xmm0,%xmm3,%xmm3
  324         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  325         vpxor   %xmm1,%xmm4,%xmm4
  326         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  327         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  328         vmovdqu 64-64(%rsi),%xmm6
  329         vpxor   %xmm2,%xmm5,%xmm5
  330         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  331         vpxor   %xmm15,%xmm8,%xmm8
  332 
  333         vmovdqu 48(%rdx),%xmm14
  334         vpxor   %xmm3,%xmm0,%xmm0
  335         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  336         vpxor   %xmm4,%xmm1,%xmm1
  337         vpshufb %xmm13,%xmm14,%xmm14
  338         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  339         vmovdqu 96-64(%rsi),%xmm6
  340         vpxor   %xmm5,%xmm2,%xmm2
  341         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  342         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  343         vmovdqu 128-64(%rsi),%xmm7
  344         vpxor   %xmm14,%xmm9,%xmm9
  345 
  346         vmovdqu 32(%rdx),%xmm15
  347         vpxor   %xmm0,%xmm3,%xmm3
  348         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  349         vpxor   %xmm1,%xmm4,%xmm4
  350         vpshufb %xmm13,%xmm15,%xmm15
  351         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  352         vmovdqu 112-64(%rsi),%xmm6
  353         vpxor   %xmm2,%xmm5,%xmm5
  354         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  355         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  356         vpxor   %xmm15,%xmm8,%xmm8
  357 
  358         vmovdqu 16(%rdx),%xmm14
  359         vpxor   %xmm3,%xmm0,%xmm0
  360         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  361         vpxor   %xmm4,%xmm1,%xmm1
  362         vpshufb %xmm13,%xmm14,%xmm14
  363         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  364         vmovdqu 144-64(%rsi),%xmm6
  365         vpxor   %xmm5,%xmm2,%xmm2
  366         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  367         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  368         vmovdqu 176-64(%rsi),%xmm7
  369         vpxor   %xmm14,%xmm9,%xmm9
  370 
  371         vmovdqu (%rdx),%xmm15
  372         vpxor   %xmm0,%xmm3,%xmm3
  373         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  374         vpxor   %xmm1,%xmm4,%xmm4
  375         vpshufb %xmm13,%xmm15,%xmm15
  376         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  377         vmovdqu 160-64(%rsi),%xmm6
  378         vpxor   %xmm2,%xmm5,%xmm5
  379         vpclmulqdq      $0x10,%xmm7,%xmm9,%xmm2
  380 
  381         leaq    128(%rdx),%rdx
  382         cmpq    $0x80,%rcx
  383         jb      .Ltail_avx
  384 
  385         vpxor   %xmm10,%xmm15,%xmm15
  386         subq    $0x80,%rcx
  387         jmp     .Loop8x_avx
  388 
  389 .balign 32
  390 .Loop8x_avx:
  391         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  392         vmovdqu 112(%rdx),%xmm14
  393         vpxor   %xmm0,%xmm3,%xmm3
  394         vpxor   %xmm15,%xmm8,%xmm8
  395         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm10
  396         vpshufb %xmm13,%xmm14,%xmm14
  397         vpxor   %xmm1,%xmm4,%xmm4
  398         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm11
  399         vmovdqu 0-64(%rsi),%xmm6
  400         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  401         vpxor   %xmm2,%xmm5,%xmm5
  402         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm12
  403         vmovdqu 32-64(%rsi),%xmm7
  404         vpxor   %xmm14,%xmm9,%xmm9
  405 
  406         vmovdqu 96(%rdx),%xmm15
  407         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  408         vpxor   %xmm3,%xmm10,%xmm10
  409         vpshufb %xmm13,%xmm15,%xmm15
  410         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  411         vxorps  %xmm4,%xmm11,%xmm11
  412         vmovdqu 16-64(%rsi),%xmm6
  413         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  414         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  415         vpxor   %xmm5,%xmm12,%xmm12
  416         vxorps  %xmm15,%xmm8,%xmm8
  417 
  418         vmovdqu 80(%rdx),%xmm14
  419         vpxor   %xmm10,%xmm12,%xmm12
  420         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  421         vpxor   %xmm11,%xmm12,%xmm12
  422         vpslldq $8,%xmm12,%xmm9
  423         vpxor   %xmm0,%xmm3,%xmm3
  424         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  425         vpsrldq $8,%xmm12,%xmm12
  426         vpxor   %xmm9,%xmm10,%xmm10
  427         vmovdqu 48-64(%rsi),%xmm6
  428         vpshufb %xmm13,%xmm14,%xmm14
  429         vxorps  %xmm12,%xmm11,%xmm11
  430         vpxor   %xmm1,%xmm4,%xmm4
  431         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  432         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  433         vmovdqu 80-64(%rsi),%xmm7
  434         vpxor   %xmm14,%xmm9,%xmm9
  435         vpxor   %xmm2,%xmm5,%xmm5
  436 
  437         vmovdqu 64(%rdx),%xmm15
  438         vpalignr        $8,%xmm10,%xmm10,%xmm12
  439         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  440         vpshufb %xmm13,%xmm15,%xmm15
  441         vpxor   %xmm3,%xmm0,%xmm0
  442         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  443         vmovdqu 64-64(%rsi),%xmm6
  444         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  445         vpxor   %xmm4,%xmm1,%xmm1
  446         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  447         vxorps  %xmm15,%xmm8,%xmm8
  448         vpxor   %xmm5,%xmm2,%xmm2
  449 
  450         vmovdqu 48(%rdx),%xmm14
  451         vpclmulqdq      $0x10,(%r10),%xmm10,%xmm10
  452         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  453         vpshufb %xmm13,%xmm14,%xmm14
  454         vpxor   %xmm0,%xmm3,%xmm3
  455         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  456         vmovdqu 96-64(%rsi),%xmm6
  457         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  458         vpxor   %xmm1,%xmm4,%xmm4
  459         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  460         vmovdqu 128-64(%rsi),%xmm7
  461         vpxor   %xmm14,%xmm9,%xmm9
  462         vpxor   %xmm2,%xmm5,%xmm5
  463 
  464         vmovdqu 32(%rdx),%xmm15
  465         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  466         vpshufb %xmm13,%xmm15,%xmm15
  467         vpxor   %xmm3,%xmm0,%xmm0
  468         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  469         vmovdqu 112-64(%rsi),%xmm6
  470         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  471         vpxor   %xmm4,%xmm1,%xmm1
  472         vpclmulqdq      $0x00,%xmm7,%xmm9,%xmm2
  473         vpxor   %xmm15,%xmm8,%xmm8
  474         vpxor   %xmm5,%xmm2,%xmm2
  475         vxorps  %xmm12,%xmm10,%xmm10
  476 
  477         vmovdqu 16(%rdx),%xmm14
  478         vpalignr        $8,%xmm10,%xmm10,%xmm12
  479         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm3
  480         vpshufb %xmm13,%xmm14,%xmm14
  481         vpxor   %xmm0,%xmm3,%xmm3
  482         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm4
  483         vmovdqu 144-64(%rsi),%xmm6
  484         vpclmulqdq      $0x10,(%r10),%xmm10,%xmm10
  485         vxorps  %xmm11,%xmm12,%xmm12
  486         vpunpckhqdq     %xmm14,%xmm14,%xmm9
  487         vpxor   %xmm1,%xmm4,%xmm4
  488         vpclmulqdq      $0x10,%xmm7,%xmm8,%xmm5
  489         vmovdqu 176-64(%rsi),%xmm7
  490         vpxor   %xmm14,%xmm9,%xmm9
  491         vpxor   %xmm2,%xmm5,%xmm5
  492 
  493         vmovdqu (%rdx),%xmm15
  494         vpclmulqdq      $0x00,%xmm6,%xmm14,%xmm0
  495         vpshufb %xmm13,%xmm15,%xmm15
  496         vpclmulqdq      $0x11,%xmm6,%xmm14,%xmm1
  497         vmovdqu 160-64(%rsi),%xmm6
  498         vpxor   %xmm12,%xmm15,%xmm15
  499         vpclmulqdq      $0x10,%xmm7,%xmm9,%xmm2
  500         vpxor   %xmm10,%xmm15,%xmm15
  501 
  502         leaq    128(%rdx),%rdx
  503         subq    $0x80,%rcx
  504         jnc     .Loop8x_avx
  505 
  506         addq    $0x80,%rcx
  507         jmp     .Ltail_no_xor_avx
  508 
  509 .balign 32
  510 .Lshort_avx:
  511         vmovdqu -16(%rdx,%rcx,1),%xmm14
  512         leaq    (%rdx,%rcx,1),%rdx
  513         vmovdqu 0-64(%rsi),%xmm6
  514         vmovdqu 32-64(%rsi),%xmm7
  515         vpshufb %xmm13,%xmm14,%xmm15
  516 
  517         vmovdqa %xmm0,%xmm3
  518         vmovdqa %xmm1,%xmm4
  519         vmovdqa %xmm2,%xmm5
  520         subq    $0x10,%rcx
  521         jz      .Ltail_avx
  522 
  523         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  524         vpxor   %xmm0,%xmm3,%xmm3
  525         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  526         vpxor   %xmm15,%xmm8,%xmm8
  527         vmovdqu -32(%rdx),%xmm14
  528         vpxor   %xmm1,%xmm4,%xmm4
  529         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  530         vmovdqu 16-64(%rsi),%xmm6
  531         vpshufb %xmm13,%xmm14,%xmm15
  532         vpxor   %xmm2,%xmm5,%xmm5
  533         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  534         vpsrldq $8,%xmm7,%xmm7
  535         subq    $0x10,%rcx
  536         jz      .Ltail_avx
  537 
  538         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  539         vpxor   %xmm0,%xmm3,%xmm3
  540         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  541         vpxor   %xmm15,%xmm8,%xmm8
  542         vmovdqu -48(%rdx),%xmm14
  543         vpxor   %xmm1,%xmm4,%xmm4
  544         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  545         vmovdqu 48-64(%rsi),%xmm6
  546         vpshufb %xmm13,%xmm14,%xmm15
  547         vpxor   %xmm2,%xmm5,%xmm5
  548         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  549         vmovdqu 80-64(%rsi),%xmm7
  550         subq    $0x10,%rcx
  551         jz      .Ltail_avx
  552 
  553         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  554         vpxor   %xmm0,%xmm3,%xmm3
  555         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  556         vpxor   %xmm15,%xmm8,%xmm8
  557         vmovdqu -64(%rdx),%xmm14
  558         vpxor   %xmm1,%xmm4,%xmm4
  559         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  560         vmovdqu 64-64(%rsi),%xmm6
  561         vpshufb %xmm13,%xmm14,%xmm15
  562         vpxor   %xmm2,%xmm5,%xmm5
  563         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  564         vpsrldq $8,%xmm7,%xmm7
  565         subq    $0x10,%rcx
  566         jz      .Ltail_avx
  567 
  568         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  569         vpxor   %xmm0,%xmm3,%xmm3
  570         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  571         vpxor   %xmm15,%xmm8,%xmm8
  572         vmovdqu -80(%rdx),%xmm14
  573         vpxor   %xmm1,%xmm4,%xmm4
  574         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  575         vmovdqu 96-64(%rsi),%xmm6
  576         vpshufb %xmm13,%xmm14,%xmm15
  577         vpxor   %xmm2,%xmm5,%xmm5
  578         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  579         vmovdqu 128-64(%rsi),%xmm7
  580         subq    $0x10,%rcx
  581         jz      .Ltail_avx
  582 
  583         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  584         vpxor   %xmm0,%xmm3,%xmm3
  585         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  586         vpxor   %xmm15,%xmm8,%xmm8
  587         vmovdqu -96(%rdx),%xmm14
  588         vpxor   %xmm1,%xmm4,%xmm4
  589         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  590         vmovdqu 112-64(%rsi),%xmm6
  591         vpshufb %xmm13,%xmm14,%xmm15
  592         vpxor   %xmm2,%xmm5,%xmm5
  593         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  594         vpsrldq $8,%xmm7,%xmm7
  595         subq    $0x10,%rcx
  596         jz      .Ltail_avx
  597 
  598         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  599         vpxor   %xmm0,%xmm3,%xmm3
  600         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  601         vpxor   %xmm15,%xmm8,%xmm8
  602         vmovdqu -112(%rdx),%xmm14
  603         vpxor   %xmm1,%xmm4,%xmm4
  604         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  605         vmovdqu 144-64(%rsi),%xmm6
  606         vpshufb %xmm13,%xmm14,%xmm15
  607         vpxor   %xmm2,%xmm5,%xmm5
  608         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  609         vmovq   184-64(%rsi),%xmm7
  610         subq    $0x10,%rcx
  611         jmp     .Ltail_avx
  612 
  613 .balign 32
  614 .Ltail_avx:
  615         vpxor   %xmm10,%xmm15,%xmm15
  616 .Ltail_no_xor_avx:
  617         vpunpckhqdq     %xmm15,%xmm15,%xmm8
  618         vpxor   %xmm0,%xmm3,%xmm3
  619         vpclmulqdq      $0x00,%xmm6,%xmm15,%xmm0
  620         vpxor   %xmm15,%xmm8,%xmm8
  621         vpxor   %xmm1,%xmm4,%xmm4
  622         vpclmulqdq      $0x11,%xmm6,%xmm15,%xmm1
  623         vpxor   %xmm2,%xmm5,%xmm5
  624         vpclmulqdq      $0x00,%xmm7,%xmm8,%xmm2
  625 
  626         vmovdqu (%r10),%xmm12
  627 
  628         vpxor   %xmm0,%xmm3,%xmm10
  629         vpxor   %xmm1,%xmm4,%xmm11
  630         vpxor   %xmm2,%xmm5,%xmm5
  631 
  632         vpxor   %xmm10,%xmm5,%xmm5
  633         vpxor   %xmm11,%xmm5,%xmm5
  634         vpslldq $8,%xmm5,%xmm9
  635         vpsrldq $8,%xmm5,%xmm5
  636         vpxor   %xmm9,%xmm10,%xmm10
  637         vpxor   %xmm5,%xmm11,%xmm11
  638 
  639         vpclmulqdq      $0x10,%xmm12,%xmm10,%xmm9
  640         vpalignr        $8,%xmm10,%xmm10,%xmm10
  641         vpxor   %xmm9,%xmm10,%xmm10
  642 
  643         vpclmulqdq      $0x10,%xmm12,%xmm10,%xmm9
  644         vpalignr        $8,%xmm10,%xmm10,%xmm10
  645         vpxor   %xmm11,%xmm10,%xmm10
  646         vpxor   %xmm9,%xmm10,%xmm10
  647 
  648         cmpq    $0,%rcx
  649         jne     .Lshort_avx
  650 
  651         vpshufb %xmm13,%xmm10,%xmm10
  652         vmovdqu %xmm10,(%rdi)
  653         vzeroupper
  654         RET
  655 .cfi_endproc
  656 SET_SIZE(gcm_ghash_avx)
  657 
  658 #endif /* !_WIN32 || _KERNEL */
  659 
  660 SECTION_STATIC
  661 .balign 64
  662 .Lbswap_mask:
  663 .byte   15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
  664 .L0x1c2_polynomial:
  665 .byte   1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
  666 .L7_mask:
  667 .long   7,0,7,0
  668 .L7_mask_poly:
  669 .long   7,0,450,0
  670 .balign 64
  671 SET_OBJ(.Lrem_4bit)
  672 .Lrem_4bit:
  673 .long   0,0,0,471859200,0,943718400,0,610271232
  674 .long   0,1887436800,0,1822425088,0,1220542464,0,1423966208
  675 .long   0,3774873600,0,4246732800,0,3644850176,0,3311403008
  676 .long   0,2441084928,0,2376073216,0,2847932416,0,3051356160
  677 SET_OBJ(.Lrem_8bit)
  678 .Lrem_8bit:
  679 .value  0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
  680 .value  0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
  681 .value  0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
  682 .value  0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
  683 .value  0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
  684 .value  0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
  685 .value  0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
  686 .value  0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
  687 .value  0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
  688 .value  0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
  689 .value  0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
  690 .value  0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
  691 .value  0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
  692 .value  0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
  693 .value  0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
  694 .value  0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
  695 .value  0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
  696 .value  0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
  697 .value  0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
  698 .value  0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
  699 .value  0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
  700 .value  0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
  701 .value  0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
  702 .value  0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
  703 .value  0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
  704 .value  0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
  705 .value  0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
  706 .value  0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
  707 .value  0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
  708 .value  0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
  709 .value  0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
  710 .value  0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
  711 
  712 .byte   71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
  713 .balign 64
  714 
  715 /* Mark the stack non-executable. */
  716 #if defined(__linux__) && defined(__ELF__)
  717 .section .note.GNU-stack,"",%progbits
  718 #endif
  719 
  720 #endif /* defined(__x86_64__) && defined(HAVE_AVX) && defined(HAVE_AES) ... */

Cache object: 32fc20b9cf02e858a216db0566ce4607


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.