won't fix: 2020/11

2020-11-30

Apple M1のSHA512命令

新しいARM Crypto Extensionだと、中国系のSM3とかSHA3とかSHA512とかの専用命令があるのだけど、Appleは比較的こういうところに投資をしてるので、Appleのチップだと実装されている。

パフォーマンス的な情報はインターネットに転がっていないので、試しに手元で実装してみた。データはいつもの通りのNSSでのデータ。

実装前のデータはこれ。

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
sha512_e         1Gb     15M       0       0.000   10000.000      10.000       168Mb

実装するとこうなる

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
sha512_e         4Gb     47M       0       0.000   10000.000      10.000       503Mb

SHA1、SHA256と同様に3倍くらい速くなる感じですね。

SHA256とは違って、ROUNDに対するレジスタが足りないので、extを使って、利用するベクタを選択しながらループしないといけないのでシンプルにはならない

#define ROUND(n, a, b, c, d, e, f, g, h, w0, w1, w2, w3, w4)              \
    {                                                                     \
        uint64x2_t t, fg, de;                                             \
        t = vaddq_u64(a, vld1q_u64(K512 + n * 2));                        \
        t = vreinterpretq_u64_u8(vextq_u8(vreinterpretq_u8_u64(t),        \
                                          vreinterpretq_u8_u64(t), 8));   \
        de = vreinterpretq_u64_u8(vextq_u8(vreinterpretq_u8_u64(w1),      \
                                           vreinterpretq_u8_u64(w2), 8)); \
        fg = vreinterpretq_u64_u8(vextq_u8(vreinterpretq_u8_u64(w2),      \
                                           vreinterpretq_u8_u64(w3), 8)); \
        w3 = vaddq_u64(w3, t);                                            \
        w3 = vsha512hq_u64(w3, fg, de);                                   \
        w4 = vaddq_u64(w1, w3);                                           \
        w3 = vsha512h2q_u64(w3, w1, w0);                                  \
        if (n < 32) {                                                     \
            a = vsha512su0q_u64(a, b);                                    \
            a = vsha512su1q_u64(a, h,                                     \
                                vextq_u8(vreinterpretq_u8_u64(e),         \
                                         vreinterpretq_u8_u64(f), 8));    \
        }                                                                 \
    }

なお、SHA512のIntrinsicsはgccの最新であれば実装されているが、clangの場合は最新のコードでも実装されていないので、インラインアセンブラ使うなりアセンブラで書かないといけない

2020-11-24

Apple M1におけるARM Crypto Extensionのベンチマーク

MacBook Airを入手したので。NSS (https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS) に含まれるbltestでのベンチデータ。比較対象にAWSのm6g.mediumのデータを置いておく

Apple M1 with ARM Crypto Extension (xcode's clang)

#     mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_ecb_e        24Gb     256      1B       0       0.000   10000.000      10.001         2Gb
 aes_ecb_d        23Gb     256      1B       0       0.000   10000.000      10.000         2Gb
 aes_cbc_e         7Gb     256    532M       0       0.000   10000.000      10.000       812Mb
 aes_cbc_d        24Gb     256      1B       0       0.000   10000.000      10.001         2Gb

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
    sha1_e         8Gb    282M       0       0.000   10000.000      10.000       861Mb
  sha256_e         8Gb    155M       0       0.000   10000.000      10.000       828Mb

AWS m6g.medium with ARM Crypto Extension (Ubuntu's clang 10)

#     mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_ecb_e         8Gb     256    550M       0       0.000   10000.000      10.000       840Mb
 aes_ecb_d         7Gb     256    502M       0       0.000   10000.000      10.000       766Mb
 aes_cbc_e         6Gb     256    465M       0       0.000   10000.000      10.000       710Mb
 aes_cbc_d         6Gb     256    442M       0       0.000   10002.000      10.002       674Mb

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
    sha1_e         4Gb    141M       0       0.000   10000.000      10.000       430Mb
  sha256_e         3Gb     66M       0       0.000   10000.000      10.000       354Mb

AES-CBCモードのEncryptionだけなぜか遅いという結果。ECBモードだと変わらないので、clangが生成したコードがApple M1のパイプラインでペナルティが発生するような状態と推測される。

なお、ARM Crypto Extensionを無効にすることもできるので、その場合。

Apple M1 without ARM Crypto Extension

#     mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_ecb_e         2Gb     256    155M       0       0.000   10000.000      10.000       236Mb
 aes_ecb_d         2Gb     256    143M       0       0.000   10000.000      10.001       218Mb
 aes_cbc_e         1Gb     256    133M       0       0.000   10000.000      10.000       203Mb
 aes_cbc_d         1Gb     256    130M       0       0.000   10000.000      10.000       199Mb

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
    sha1_e         3Gb    108M       0       0.000   10000.000      10.000       331Mb
  sha256_e         1Gb     21M       0       0.000   10000.000      10.000       113Mb

AWS m6g.medium without ARM Crypto Extension

#     mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_ecb_e         1Gb     256     82M       0       0.000   10000.000      10.000       126Mb
 aes_ecb_d         1Gb     256     80M       0       0.000   10000.000      10.000       122Mb
 aes_cbc_e         1Gb     256     81M       0       0.000   10000.000      10.000       124Mb
 aes_cbc_d         1Gb     256     73M       0       0.000   10000.000      10.000       112Mb

#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
    sha1_e         1Gb     45M       0       0.000   10000.000      10.000       139Mb
  sha256_e       872Mb     16M       0       0.000   10000.000      10.000        87Mb