admin管理员组

文章数量:1349744

I have an executable generated inside composable_kernel. Observing the Makefile output, I see the exact command that generates the executable to be:

/opt/rocm/bin/hipcc -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDL_KERNELS -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/home/amysoren/composable_kernel/library/include -I/home/amysoren/composable_kernel/include -I/home/amysoren/composable_kernel/build/include -O3 -DNDEBUG -std=gnu++17 --offload-arch=gfx942 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-coerce-illegal-types=1 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics  -o source.cpp.o -x hip -c /home/amysoren/composable_kernel/example/04_gemm_add_add_fastgelu/gemm_add_add_fastgelu_xdl_int8.cpp
/opt/rocm-6.3.0/lib/llvm/bin/clang++ --driver-mode=g++ --hip-link  -O3 -DNDEBUG --hip-link --rtlib=compiler-rt -unwindlib=libgcc CMakeFiles/example_gemm_add_add_fastgelu_xdl_int8.dir/gemm_add_add_fastgelu_xdl_int8.cpp.o -o "../../bin/example_gemm_add_add_fastgelu_xdl_int8" ../../lib/libutility.a /opt/rocm/lib/libamdhip64.so.6.3.60300 /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/linux/libclang_rt.builtins-x86_64.a /opt/rocm-6.3.0/lib/libamdhip64.so.6.3.60300

When I run with HIPCC_VERBOSE, I see the exact set of commands to be:

# generate device-side object                                                                                                                                                                                                    
/opt/rocm-6.3.0/lib/llvm/bin/clang-18 -cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name gemm_add_add_fastgelu_xdl_int8.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_isa_version_942.bc -mlink-builtin-bitcode /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/oclc_abi_version_500.bc -target-cpu gfx942 -debugger-tuning=gdb -fdebug-compilation-dir=/home/amysoren/composable_kernel/build -resource-dir /opt/rocm-6.3.0/lib/llvm/lib/clang/18 -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include/cuda_wrappers -idirafter /opt/rocm-6.3.0/lib/llvm/bin/../../../include -include __clang_hip_runtime_wrapper.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D CK_TIME_KERNEL=1 -D CK_USE_FNUZ_FP8 -D CK_USE_GFX94 -D CK_USE_XDL -D DL_KERNELS -D DPP_KERNELS -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -D __HIP_ROCclr__=1 -I /home/amysoren/composable_kernel/library/include -I /home/amysoren/composable_kernel/include -I /home/amysoren/composable_kernel/build/include -D NDEBUG -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Werror -Weverything -std=gnu++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-coerce-illegal-types=1 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -cuid=e9b00f5e58068913 -fcuda-allow-variadic-functions -fno-offload-uniform-block -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o device.o -x hip source.cpp      
# replace above -emit-obj with -S to get device-side disassembly (device.s)
/opt/rocm-6.3.0/lib/llvm/bin/lld -flavor gnu -m elf64_amdgpu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx942 -plugin-opt=O3 --lto-CGO3 --whole-archive -o device.out device.o --no-whole-archive
/opt/rocm-6.3.0/lib/llvm/bin/clang-offload-bundler -type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx942 -input=/dev/null -input=device.out -output=device.hipfb
 
# generate host-side object and combine with device.hipfb
/opt/rocm-6.3.0/lib/llvm/bin/clang-18 -cc1 -triple x86_64-unknown-linux-gnu -aux-triple amdgcn-amd-amdhsa -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name gemm_add_add_fastgelu_xdl_int8.cpp -mrelocation-model static -mframe-pointer=none -fmath-errno -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64 -tune-cpu generic -debugger-tuning=gdb -fdebug-compilation-dir=/home/amysoren/composable_kernel/build -fcoverage-compilation-dir=/home/amysoren/composable_kernel/build -resource-dir /opt/rocm-6.3.0/lib/llvm/lib/clang/18 -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include/cuda_wrappers -idirafter /opt/rocm-6.3.0/lib/llvm/bin/../../../include -include __clang_hip_runtime_wrapper.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D CK_TIME_KERNEL=1 -D CK_USE_FNUZ_FP8 -D CK_USE_GFX94 -D CK_USE_XDL -D DL_KERNELS -D DPP_KERNELS -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -D __HIP_ROCclr__=1 -I /home/amysoren/composable_kernel/library/include -I /home/amysoren/composable_kernel/include -I /home/amysoren/composable_kernel/build/include -D NDEBUG -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /opt/rocm-6.3.0/lib/llvm/lib/clang/18/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Werror -Weverything -std=gnu++17 -fdeprecated-macro -ferror-limit 19 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-coerce-illegal-types=1 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -fcuda-include-gpubinary device.hipfb -cuid=e9b00f5e58068913 -fcuda-allow-variadic-functions -fno-offload-uniform-block -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o final.cpp.o -x hip source.cpp
# generate the final executable 
/opt/rocm-6.3.0/lib/llvm/bin/clang++ --driver-mode=g++ --hip-link  -O3 -DNDEBUG --hip-link --rtlib=compiler-rt -unwindlib=libgcc final.cpp.o -o final lib/libutility.a /opt/rocm/lib/libamdhip64.so.6.3.60300 /opt/rocm-6.3.0/lib/llvm/lib/clang/18/lib/linux/libclang_rt.builtins-x86_64.a /opt/rocm-6.3.0/lib/libamdhip64.so.6.3.60300

Now I'm try to modify the device.s obtained transiently, and recombine with the final executable with the host-side code unchanged. The modification is simple register renaming.

/opt/rocm/bin/hipcc -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDL_KERNELS -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/home/amysoren/composable_kernel/library/include -I/home/amysoren/composable_kernel/include -I/home/amysoren/composable_kernel/build/include -O3 -DNDEBUG -std=gnu++17 --offload-arch=gfx942 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Werror -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-coerce-illegal-types=1 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics  -o device.o  -c -target amdgcn-amd-amdhsa -mcpu=gfx942 device.s
# follow rest of the previous steps

And when I run the final executable I encounter:

terminate called after throwing an instance of 'std::runtime_error'
  what():  HIP runtime error: no kernel image is available for execution on the device. hip_check_error.hpp: 17in function: hip_check_error
Aborted (core dumped)

Is there something I'm missing?

本文标签: amd rocmRecompiling a unified executable with modified disassemblyStack Overflow