Tensor Fusion Host + VM Deployment Guide

Run tensor-fusion-worker on the host and deploy tensor-fusion-client inside a virtual machine (VM) to use the host’s physical GPUs from VMs without GPU passthrough.

Terminology

Tensor-fusion-worker (worker): A standalone binary. You can run multiple worker instances on the same host. It receives compute requests from clients and executes the tasks.
Tensor-fusion-client (client): A set of shared libraries that fully export CUDA and NVML APIs. It runs inside your application environment and forwards compute requests to the worker on the host.

NOTE

Multiple clients can connect to one worker instance. For production, we recommend one client per worker instance for easier resource isolation and management.

Prerequisites

A Linux host with NVIDIA GPUs. Recommended NVIDIA driver version: 570.xx or newer.
One or more VMs (QEMU/MVisor/VMware/Hyper‑V, etc.) that can reach the host either over the network or via shared memory.

Step 1: Download Tensor Fusion

Client

TIPS: For Windows version, please contact us ([email protected])

Server

tensor-fusion-worker

Step 2: Install tensor-fusion-worker on the host

Check the worker binary options:

bash

./tensor-fusion-worker -h
Usage: tensor-fusion-worker [option]
Options
  -h, --help            Display this information.
  -v, --version         Display version information.
  -n, --net             Specified network protocol.
  -p, --port            Specified the port for server.
  -l, --load            Specified the file path of snapshot.
  -m, --shmem-file      Specified the file path of shared memory.
  -M, --shmem-size      Specified the size(MB) of shared memory.

NOTE

Worker supports two protocols: NATIVE and SHMEM.
NATIVE uses TCP, simple to deploy; recommended when VMs and host are in the same network with ping <= ~0.8 ms.
SHMEM uses shared memory, lowest latency; best for latency-sensitive workloads.

Start with NATIVE protocol on port 12345:

bash

TF_ENABLE_LOG=1 ./tensor-fusion-worker -n native -p 12345

Start with SHMEM protocol, creating shared memory at /my_shm with size 256 MB:

bash

TF_ENABLE_LOG=1 ./tensor-fusion-worker -n shmem -m /my_shm -M 256

Environment variables

TF_ENABLE_LOG: Enable logging (default: off)
TF_LOG_LEVEL: trace|debug|info|warn|error (default: info)
TF_LOG_PATH: Log file path (default: empty, i.e., stdout)
TF_CUDA_MEMORY_LIMIT: CUDA memory limit in MB (default: unlimited)

Step 3: Install tensor-fusion-client in the VM

Windows

Place files (nvcuda.dll, nvml.dll, teleport.dll) either in the system PATH or alongside your application so they can be loaded.

Linux

Use LD_LIBRARY_PATH or LD_PRELOAD to inject the shared libraries (libcuda.so, libnvidia-ml.so, libteleport.so) into your application process.

TIPS: OS environments vary. Please ensure the client libraries are actually loaded by your application.

Environment variables

TF_ENABLE_LOG: Enable logging (default: off)
TF_LOG_LEVEL: trace|debug|info|warn|error (default: info)
TF_LOG_PATH: Log file path (default: empty; on Linux logs to console, on Windows view with DebugView)
TENSOR_FUSION_OPERATOR_GET_CONNECTION_URL: An HTTP GET endpoint that returns connection info (optional)
TENSOR_FUSION_OPERATOR_CONNECTION_INFO: Connection info in the format protocol+param1+param2+version (version is currently 0)
TF_MAX_CACHE_REQUEST_COUNT: Maximum cache request count (default: 100)

TIPS: You can set the connection info directly via TENSOR_FUSION_OPERATOR_CONNECTION_INFO, or provide a GET endpoint via TENSOR_FUSION_OPERATOR_GET_CONNECTION_URL that returns it.

Step 4: Verify the setup

We’ll verify inside a Linux VM using Python with PyTorch CUDA.

NATIVE protocol

If the worker runs with NATIVE at host IP 192.168.1.100 and port 12345:

Set the connection info (format: native+[host-ip]+[port]+[version]):

bash

export TENSOR_FUSION_OPERATOR_CONNECTION_INFO=native+192.168.1.100+12345+0

TIPS: Ensure the VM can reach the host IP and port.

SHMEM protocol

If the worker runs with SHMEM, creates /my_shm and sets size to 256 MB, start your VM with QEMU and attach the shared memory via IVSHMEM:

QEMU command example

bash

qemu-system-x86_64 \
  -m 8192 \
  -hda centos8.4.qcow2 \
  -vnc :1 \
  -enable-kvm \
  -cpu host \
  -object memory-backend-file,id=shm0,mem-path=/dev/shm/my_shm,size=256M,share=on \
  -device ivshmem-plain,memdev=shm0 \
  -smp 4

Locate the IVSHMEM device inside the VM:

bash

lspci -vv | grep -i Inter-VM

Possible output:

bash

00:04.0 RAM memory: Red Hat, Inc. Inter-VM shared memory (rev 01)

Hence the device BDF is 00:04.0.

Set the connection info. (format: shmem+[resource]+[size]+[version]) Use the device resource file (e.g., resource2), match the size (MB), and version 0:

bash

export TENSOR_FUSION_OPERATOR_CONNECTION_INFO=shmem+/sys/devices/pci0000:00/0000:00:04.0/resource2+256+0

TIPS:

The shared-memory size must match the worker’s setting.
Root privileges are typically required to mmap the IVSHMEM BAR resource.

Inject dynamic libraries

bash

export LD_PRELOAD=/opt/tensor-fusion-client/linux/libcuda.so:/opt/tensor-fusion-client/linux/libnvidia-ml.so

NOTE

Assuming the client is installed under /opt/tensor-fusion-client. Adjust paths as needed.

PyTorch validation example

Once the environment variables are set, run the following (PyTorch + Qwen3‑0.6B) in the VM:

bash

pip install modelscope packaging transformers accelerate

cat << 'EOF' >> test-qwen.py
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="cuda:0"
)

prompt = "Give me a short introduction to large language model."
messages = [
	{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
	# rindex finding 151668 (</think>)
	index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
	index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)
EOF

python3 test-qwen.py

Validation criteria:

The VM prints normal model outputs.
On the host, nvidia-smi shows the corresponding inference process and GPU memory usage.

Tensor Fusion Host + VM Deployment Guide ​

Terminology ​

Prerequisites ​

Step 1: Download Tensor Fusion ​

Client ​

Server ​

Step 2: Install tensor-fusion-worker on the host ​

Step 3: Install tensor-fusion-client in the VM ​

Windows ​

Linux ​

Step 4: Verify the setup ​

NATIVE protocol ​

SHMEM protocol ​

Inject dynamic libraries ​

PyTorch validation example ​

Tensor Fusion Host + VM Deployment Guide

Terminology

Prerequisites

Step 1: Download Tensor Fusion

Client

Server

Step 2: Install tensor-fusion-worker on the host

Step 3: Install tensor-fusion-client in the VM

Windows

Linux

Step 4: Verify the setup

NATIVE protocol

SHMEM protocol

Inject dynamic libraries

PyTorch validation example