A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.
ResNet-26-T series
- [2, 2, 2, 2] repeat Bottlneck block ResNet architecture
- ReLU activations
- 3 layer stem with 24, 32, 64 chs, max-pool
- avg pool in shortcut downsample
- self-attn blocks replace 3x3 in both blocks for last stage, and second block of penultimate stage
model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
---|---|---|---|---|---|---|---|---|
botnet26t_256 | 79.246 | 20.754 | 94.53 | 5.47 | 12.49 | 256 | 0.95 | bicubic |
halonet26t | 79.13 | 20.87 | 94.314 | 5.686 | 12.48 | 256 | 0.95 | bicubic |
lambda_resnet26t | 79.112 | 20.888 | 94.59 | 5.41 | 10.96 | 256 | 0.94 | bicubic |
lambda_resnet26rpt_256 | 78.964 | 21.036 | 94.428 | 5.572 | 10.99 | 256 | 0.94 | bicubic |
resnet26t | 77.872 | 22.128 | 93.834 | 6.166 | 16.01 | 256 | 0.94 | bicubic |
Details:
- HaloNet - 8 pixel block size, 2 pixel halo (overlap), relative position embedding
- BotNet - relative position embedding
- Lambda-ResNet-26-T - 3d lambda conv, kernel = 9
- Lambda-ResNet-26-RPT - relative position embedding
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnet26t | 2967.55 | 86.252 | 256 | 256 | 857.62 | 297.984 | 256 | 256 | 16.01 |
botnet26t_256 | 2642.08 | 96.879 | 256 | 256 | 809.41 | 315.706 | 256 | 256 | 12.49 |
halonet26t | 2601.91 | 98.375 | 256 | 256 | 783.92 | 325.976 | 256 | 256 | 12.48 |
lambda_resnet26t | 2354.1 | 108.732 | 256 | 256 | 697.28 | 366.521 | 256 | 256 | 10.96 |
lambda_resnet26rpt_256 | 1847.34 | 138.563 | 256 | 256 | 644.84 | 197.892 | 128 | 256 | 10.99 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnet26t | 3691.94 | 69.327 | 256 | 256 | 1188.17 | 214.96 | 256 | 256 | 16.01 |
botnet26t_256 | 3291.63 | 77.76 | 256 | 256 | 1126.68 | 226.653 | 256 | 256 | 12.49 |
halonet26t | 3230.5 | 79.232 | 256 | 256 | 1077.82 | 236.934 | 256 | 256 | 12.48 |
lambda_resnet26rpt_256 | 2324.15 | 110.133 | 256 | 256 | 864.42 | 147.485 | 128 | 256 | 10.99 |
lambda_resnet26t | Not Supported |
ResNeXT-26-T series
- [2, 2, 2, 2] repeat Bottlneck block ResNeXt architectures
- SiLU activations
- grouped 3x3 convolutions in bottleneck, 32 channels per group
- 3 layer stem with 24, 32, 64 chs, max-pool
- avg pool in shortcut downsample
- channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
- when active, self-attn blocks replace 3x3 conv in both blocks for last stage, and second block of penultimate stage
model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
---|---|---|---|---|---|---|---|---|
eca_halonext26ts | 79.484 | 20.516 | 94.600 | 5.400 | 10.76 | 256 | 0.94 | bicubic |
eca_botnext26ts_256 | 79.270 | 20.730 | 94.594 | 5.406 | 10.59 | 256 | 0.95 | bicubic |
bat_resnext26ts | 78.268 | 21.732 | 94.1 | 5.9 | 10.73 | 256 | 0.9 | bicubic |
seresnext26ts | 77.852 | 22.148 | 93.784 | 6.216 | 10.39 | 256 | 0.9 | bicubic |
gcresnext26ts | 77.804 | 22.196 | 93.824 | 6.176 | 10.48 | 256 | 0.9 | bicubic |
eca_resnext26ts | 77.446 | 22.554 | 93.57 | 6.43 | 10.3 | 256 | 0.9 | bicubic |
resnext26ts | 76.764 | 23.236 | 93.136 | 6.864 | 10.3 | 256 | 0.9 | bicubic |
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnext26ts | 3006.57 | 85.134 | 256 | 256 | 864.4 | 295.646 | 256 | 256 | 10.3 |
seresnext26ts | 2931.27 | 87.321 | 256 | 256 | 836.92 | 305.193 | 256 | 256 | 10.39 |
eca_resnext26ts | 2925.47 | 87.495 | 256 | 256 | 837.78 | 305.003 | 256 | 256 | 10.3 |
gcresnext26ts | 2870.01 | 89.186 | 256 | 256 | 818.35 | 311.97 | 256 | 256 | 10.48 |
eca_botnext26ts_256 | 2652.03 | 96.513 | 256 | 256 | 790.43 | 323.257 | 256 | 256 | 10.59 |
eca_halonext26ts | 2593.03 | 98.705 | 256 | 256 | 766.07 | 333.541 | 256 | 256 | 10.76 |
bat_resnext26ts | 2469.78 | 103.64 | 256 | 256 | 697.21 | 365.964 | 256 | 256 | 10.73 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks.
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnext26ts | 3952.37 | 64.755 | 256 | 256 | 608.67 | 420.049 | 256 | 256 | 10.3 |
eca_resnext26ts | 3815.77 | 67.074 | 256 | 256 | 594.35 | 430.146 | 256 | 256 | 10.3 |
seresnext26ts | 3802.75 | 67.304 | 256 | 256 | 592.82 | 431.14 | 256 | 256 | 10.39 |
gcresnext26ts | 3626.97 | 70.57 | 256 | 256 | 581.83 | 439.119 | 256 | 256 | 10.48 |
eca_botnext26ts_256 | 3515.84 | 72.8 | 256 | 256 | 611.71 | 417.862 | 256 | 256 | 10.59 |
eca_halonext26ts | 3410.12 | 75.057 | 256 | 256 | 597.52 | 427.789 | 256 | 256 | 10.76 |
bat_resnext26ts | 3053.83 | 83.811 | 256 | 256 | 533.23 | 478.839 | 256 | 256 | 10.73 |
ResNet-33-T series.
- [2, 3, 3, 2] repeat Bottlneck block ResNet architecture
- SiLU activations
- 3 layer stem with 24, 32, 64 chs, no max-pool, 1st and 3rd conv stride 2
- avg pool in shortcut downsample
- channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
- when active, self-attn blocks replace 3x3 conv last block of stage 2 and 3, and both blocks of final stage
- FC 1x1 conv between last block and classifier
The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.
model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
---|---|---|---|---|---|---|---|---|
sehalonet33ts | 80.986 | 19.014 | 95.272 | 4.728 | 13.69 | 256 | 0.94 | bicubic |
seresnet33ts | 80.388 | 19.612 | 95.108 | 4.892 | 19.78 | 256 | 0.94 | bicubic |
eca_resnet33ts | 80.132 | 19.868 | 95.054 | 4.946 | 19.68 | 256 | 0.94 | bicubic |
gcresnet33ts | 79.99 | 20.01 | 94.988 | 5.012 | 19.88 | 256 | 0.94 | bicubic |
resnet33ts | 79.352 | 20.648 | 94.596 | 5.404 | 19.68 | 256 | 0.94 | bicubic |
resnet32ts | 79.028 | 20.972 | 94.444 | 5.556 | 17.96 | 256 | 0.94 | bicubic |
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnet32ts | 2502.96 | 102.266 | 256 | 256 | 733.27 | 348.507 | 256 | 256 | 17.96 |
resnet33ts | 2473.92 | 103.466 | 256 | 256 | 725.34 | 352.309 | 256 | 256 | 19.68 |
seresnet33ts | 2400.18 | 106.646 | 256 | 256 | 695.19 | 367.413 | 256 | 256 | 19.78 |
eca_resnet33ts | 2394.77 | 106.886 | 256 | 256 | 696.93 | 366.637 | 256 | 256 | 19.68 |
gcresnet33ts | 2342.81 | 109.257 | 256 | 256 | 678.22 | 376.404 | 256 | 256 | 19.88 |
sehalonet33ts | 1857.65 | 137.794 | 256 | 256 | 577.34 | 442.545 | 256 | 256 | 13.69 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|
resnet32ts | 3306.22 | 77.416 | 256 | 256 | 1012.82 | 252.158 | 256 | 256 | 17.96 |
resnet33ts | 3257.59 | 78.573 | 256 | 256 | 1002.38 | 254.778 | 256 | 256 | 19.68 |
seresnet33ts | 3128.08 | 81.826 | 256 | 256 | 950.27 | 268.581 | 256 | 256 | 19.78 |
eca_resnet33ts | 3127.11 | 81.852 | 256 | 256 | 948.84 | 269.123 | 256 | 256 | 19.68 |
gcresnet33ts | 2984.87 | 85.753 | 256 | 256 | 916.98 | 278.169 | 256 | 256 | 19.88 |
sehalonet33ts | 2188.23 | 116.975 | 256 | 256 | 711.63 | 179.03 | 128 | 256 | 13.69 |
ResNet-50(ish) models
In Progress
RegNet"Z" series
- RegNetZ inspired architecture, inverted bottleneck, SE attention, pre-classifier FC, essentially an EfficientNet w/ grouped conv instead of depthwise
- b, c, and d are three different sizes I put together to cover differing flop ranges, not based on the paper (https://arxiv.org/abs/2103.06877) or a search process
- for comparison to RegNetY and paper RegNetZ models, at 224x224 b,c, and d models are 1.45, 1.92, and 4.58 GMACs respectively, b, and c are trained at 256 here so higher than that (see tables)
haloregnetz_c
uses halo attention for all of last stage, and interleaved every 3 (for 4) of penultimate stage- b, c variants use a stem / 1st stage like the paper, d uses a 3-deep tiered stem with 2-1-2 striding
ImageNet-1k validation at train resolution
model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
---|---|---|---|---|---|---|---|---|
regnetz_d | 83.422 | 16.578 | 96.636 | 3.364 | 27.58 | 256 | 0.95 | bicubic |
regnetz_c | 82.164 | 17.836 | 96.058 | 3.942 | 13.46 | 256 | 0.94 | bicubic |
haloregnetz_b | 81.058 | 18.942 | 95.2 | 4.8 | 11.68 | 224 | 0.94 | bicubic |
regnetz_b | 79.868 | 20.132 | 94.988 | 5.012 | 9.72 | 224 | 0.94 | bicubic |
ImageNet-1k validation at optimal test res
model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
---|---|---|---|---|---|---|---|---|
regnetz_d | 84.04 | 15.96 | 96.87 | 3.13 | 27.58 | 320 | 0.95 | bicubic |
regnetz_c | 82.516 | 17.484 | 96.356 | 3.644 | 13.46 | 320 | 0.94 | bicubic |
haloregnetz_b | 81.058 | 18.942 | 95.2 | 4.8 | 11.68 | 224 | 0.94 | bicubic |
regnetz_b | 80.728 | 19.272 | 95.47 | 4.53 | 9.72 | 288 | 0.94 | bicubic |
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | infer_GMACs | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|---|
regnetz_b | 2703.42 | 94.68 | 256 | 224 | 1.45 | 764.85 | 333.348 | 256 | 224 | 9.72 |
haloregnetz_b | 2086.22 | 122.695 | 256 | 224 | 1.88 | 620.1 | 411.415 | 256 | 224 | 11.68 |
regnetz_c | 1653.19 | 154.836 | 256 | 256 | 2.51 | 459.41 | 277.268 | 128 | 256 | 13.46 |
regnetz_d | 1060.91 | 241.284 | 256 | 256 | 5.98 | 296.51 | 430.143 | 128 | 256 | 27.58 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
NOTE: channels last layout is painfully slow for backward pass here due to some sort of cuDNN issue
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | infer_GMACs | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
---|---|---|---|---|---|---|---|---|---|---|
regnetz_b | 4152.59 | 61.634 | 256 | 224 | 1.45 | 399.37 | 639.572 | 256 | 224 | 9.72 |
haloregnetz_b | 2770.78 | 92.378 | 256 | 224 | 1.88 | 364.22 | 701.386 | 256 | 224 | 11.68 |
regnetz_c | 2512.4 | 101.878 | 256 | 256 | 2.51 | 376.72 | 338.372 | 128 | 256 | 13.46 |
regnetz_d | 1456.05 | 175.8 | 256 | 256 | 5.98 | 111.32 | 1148.279 | 128 | 256 | 27.58 |