Abstract

This article examines two potential costs of the abliteration process for Large Language Models: performance penalties and output quality degradation. For comparison, I evaluated the rising stars of local LLM models in autumn 2025: Qwen3-VL variants, alongside cloud based solutions from Claude and Gemini.

Model Selection Rationale

The Qwen3-VL Family

I selected two architecturally distinct models from the Qwen3-VL family, both released in autumn 2025:

  • Qwen3-VL-32B employs a dense architecture, activating all parameters during inference, prioritizing output quality at the expense of computational efficiency;
  • Qwen3-VL-30B-A3B utilizes a Mixture of Experts (MoE) architecture, activating only a fraction of its parameters (~3B) per inference step, yielding significantly faster inference times while maintaining competitive quality.

Both architectures support multimodal input (text and images), positioning them as viable local alternatives to cloud based models such as ChatGPT, Claude, and Gemini. Prequantized GGUF weights are available in multiple precisions: F16 (full precision), Q8_0 (8-bit), and Q4_K_M (4-bit).

Abliteration Methodology

Abliteration was performed by an unknown party and uploaded to huggingface.co. The methodology, as described in Arditi et al. (2024), removes safety constraints from language models through targeted weight modifications. The technique was applied using the open source implementation remove-refusals-with-transformers.

Model Variants Evaluated

For this analysis, I compared original and abliterated versions across two model lines:

Performance Benchmarks

To maximize observable differences, I conducted all benchmarks on CPU using llama-bench1, here raw results.

Prompt Processing (pp512)

Processing a 512-token prompt:

ModelAbliteratedQuantizationt/sσ
Qwen3-VL-30B-A3B-InstructyesF166.431.27
Qwen3-VL-30B-A3B-InstructnoF166.561.69
Qwen3-VL-30B-A3B-InstructyesQ8_09.182.22
Qwen3-VL-30B-A3B-InstructnoQ8_08.761.75
Qwen3-VL-30B-A3B-InstructyesQ4_K_M50.0411.48
Qwen3-VL-30B-A3B-InstructnoQ4_K_M50.2412.37
Qwen3-VL-32B-ThinkingyesF169.430.01
Qwen3-VL-32B-ThinkingnoF169.410.02
Qwen3-VL-32B-ThinkingyesQ8_012.970.04
Qwen3-VL-32B-ThinkingnoQ8_012.990.02
Qwen3-VL-32B-ThinkingyesQ4_K_M34.600.14
Qwen3-VL-32B-ThinkingnoQ4_K_M34.550.14

Text Generation (tg128)

Generating 128 tokens:

ModelAbliteratedQuantizationt/sσ
Qwen3-VL-30B-A3B-InstructyesF161.450.38
Qwen3-VL-30B-A3B-InstructnoF162.141.11
Qwen3-VL-30B-A3B-InstructyesQ8_02.651.28
Qwen3-VL-30B-A3B-InstructnoQ8_02.240.44
Qwen3-VL-30B-A3B-InstructyesQ4_K_M8.892.02
Qwen3-VL-30B-A3B-InstructnoQ4_K_M11.112.36
Qwen3-VL-32B-ThinkingyesF160.680.00
Qwen3-VL-32B-ThinkingnoF160.680.00
Qwen3-VL-32B-ThinkingyesQ8_01.270.00
Qwen3-VL-32B-ThinkingnoQ8_01.270.00
Qwen3-VL-32B-ThinkingyesQ4_K_M2.190.00
Qwen3-VL-32B-ThinkingnoQ4_K_M2.190.00

Performance Analysis

The MoE architecture exhibits unpredictable performance characteristics inherent to its expert routing mechanism, with abliteration appearing to interfere with this routing and occasionally degrading throughput. The dense architecture shows negligible performance impact from abliteration, with variance remaining within measurement error.

Qualitative Evaluation: Reasoning Quality

To assess reasoning quality, I provided each model with Ken Thompson’s seminal paper “Reflections on Trusting Trust” (1984) alongside the following prompt:

You are a battle hardened security specialist who has spent decades in offensive security and incident response. After reading the provided paper, explain how it changes your threat model and what defense strategy you would employ going forward. Be thorough and technical. Think paranoid but rational. Keep your response short and efficient.

This prompt was deliberately constrained. I did not request document summarization, instead allowing models to engage with Thompson’s work autonomously. Each model generated a single response without retry.

Key Findings

Abliteration Effects by Architecture: MoE models (30B-A3B-Instruct) exhibited substantial reasoning degradation post abliteration, suggesting that safety oriented experts contribute to the reasoning pipeline and that removing these experts damages the model’s ability to maintain coherent chain-of-thought reasoning. Dense models (32B-Thinking) showed negligible or slightly positive effects from abliteration, indicating that safety constraints in dense architectures exist as separable layers rather than integrated components.

Quantization Effects: No meaningful correlation between quantization level and output quality was observed. Models quantized to Q4_K_M performed comparably to F16 variants, demonstrating robust reasoning capabilities across precision levels.

Comparative Performance: The 32B-Thinking variants matched or exceeded Claude Sonnet 4.5 and Gemini 2.5 Pro in reasoning quality while adhering more closely to the “short and efficient” instruction, with cloud models tending toward verbosity (300-700 words) compared to Qwen3 variants (230-280 words).

Model Outputs

Complete responses from all evaluated models:

Conclusions

This analysis reveals that abliteration costs are architecture dependent:

  1. MoE architectures suffer substantial reasoning degradation postabliteration, likely due to the removal of safety oriented experts that contribute to core reasoning capabilities.

  2. Dense architectures exhibit minimal sensitivity to abliteration, suggesting safety constraints are implemented as separable layers that do not compromise base reasoning when removed.

  3. Quantization (F16 to Q4_K_M) has negligible impact on reasoning quality across both architectures, validating the use of aggressive quantization for resource constrained deployments.

  4. Local models (Qwen3-VL-32B-Thinking) achieve parity with leading cloud models (Claude Sonnet, Gemini Pro) while demonstrating superior instruction adherence, particularly regarding response conciseness.

These findings suggest that abliteration is a viable technique for dense architectures but should be avoided for MoE models where expert specialization extends beyond safety considerations into core reasoning functions.

Appendix I: Model Size Scaling

To investigate parameter count effects on reasoning capability, I evaluated abliterated Qwen3-VL-Thinking models at Q4_K_M quantization across four sizes: 2B, 4B, 8B, and 32B.

Performance Benchmarks

Sizepp512 (t/s)tg128 (t/s)Speed vs 32B
2B596.2135.62~17×\times
4B250.7515.95~7×\times
8B130.478.82~4×\times
32B34.602.191×\times

Inference throughput scales approximately linearly with inverse parameter count.

Reasoning Quality

Each model analyzed Thompson’s paper with identical prompts. The 2B model fundamentally misunderstands the argument, proposing source level verification, exactly what Thompson proves insufficient, inverting the conclusion. The 4B model recognizes compiler threats but misses self-perpetuation through recompilation, proposing wrong layer defenses. The 8B model demonstrates the first correct understanding of the recursive attack, proposing appropriate defenses (external trusted compiler). The 32B model provides comprehensive analysis with precise terminology and sophisticated defense strategies, similar to cloud models.

Complete responses: 2B, 4B, 8B and 32B.

Appendix II: KV cache quantization

To investigate KV cache quantization effects, I evaluated Qwen3-VL-32B-Thinking-Abliterated-Q4_K_M with three KV cache precision levels: F16, Q8_0, and Q4_0, using this article and all referenced materials (up to Appendix II, 17926 tokens) as input for verification with following prompt:2

Can you verify this article? I’ve attached article (index.md), and all referenced materials.

Results

QuantizationF16Q8_0Q4_0
KV cache size (Gb)643418
Processing t/s5.374.414.75
Response t/s0.760.851.22
Response tokens5'5775'2602'988
Total tokens23'50323'18620'914
Total time2h 57m 9s2h 51m 5s1h 43m 34s
Total time (ms)10'629'467.2110'265'370.826'214'870.61

Key Finding

F16 produced correct verification, accurately identifying all findings.

Q8_0 maintained superficial coherence but produced fundamentally incorrect analysis: it inverted the article’s correct conclusions, claiming that 2B models “have solid understanding” when they actually misunderstand Thompson’s argument. The quantization noise caused the model to lose track of which outputs came from which models across the long context, leading to confident but logically backwards verification.

Q4_0 showed severe degradation with 46% reduction in output and thinking length.

KV cache quantization at Q8_0 introduces cumulative errors that destroy analytical reliability while maintaining grammatical fluency, the most dangerous failure mode. For multi document analysis, verification, or long context tasks, F16 precision is essential.

Notable distinction: model weight quantization doesn’t affect results in the same way as KV cache quantization. Model quantization uses an importance matrix built from one or more corpus of texts during the quantization process, with the resulting quantized model having measured perplexity against different corpus of texts to confirm that quantization introduces only negligible noise. In contrast, KV cache quantization operates on runtime activations without importance weighting or corpus based calibration, causing quantization noise to compound with each token across the context window.

Complete responses: F16, Q8_0 and Q4_0.


  1. llama.cpp snapshot b6934 with libggml snapshot 20251101 with backported fix REPACK at high thread counts, benchmark runs on AMD Ryzen 9 7950X3D 16-Core Processor (4200 MHz) and using libggml-cpu-icelake.so backend ↩︎

  2. llama-server was started as: llama-server -c 0 --cache-reuse 32 --jinja --cache-type-k [TYPE] --cache-type-v [TYPE] --context-shift -hf huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated:Q4_K_M which runs as: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | REPACK = 1 | ↩︎