Abstract
This article examines two potential costs of the abliteration process for Large Language Models: performance penalties and output quality degradation. For comparison, I evaluated the rising stars of local LLM models in autumn 2025: Qwen3-VL variants, alongside cloud based solutions from Claude and Gemini.
Model Selection Rationale
The Qwen3-VL Family
I selected two architecturally distinct models from the Qwen3-VL family, both released in autumn 2025:
Qwen3-VL-32Bemploys a dense architecture, activating all parameters during inference, prioritizing output quality at the expense of computational efficiency;Qwen3-VL-30B-A3Butilizes a Mixture of Experts (MoE) architecture, activating only a fraction of its parameters (~3B) per inference step, yielding significantly faster inference times while maintaining competitive quality.
Both architectures support multimodal input (text and images),
positioning them as viable local alternatives to cloud based models such
as ChatGPT, Claude, and Gemini. Prequantized GGUF weights are available
in multiple precisions: F16 (full precision), Q8_0 (8-bit), and
Q4_K_M (4-bit).
Abliteration Methodology
Abliteration was performed by an unknown party and uploaded to huggingface.co. The methodology, as described in Arditi et al. (2024), removes safety constraints from language models through targeted weight modifications. The technique was applied using the open source implementation remove-refusals-with-transformers.
Model Variants Evaluated
For this analysis, I compared original and abliterated versions across two model lines:
- Qwen3-VL-30B-A3B-Instruct and its abliterated variant, optimized for inference speed via MoE;
- Qwen3-VL-32B-Thinking and its abliterated variant, optimized for reasoning quality via dense architecture.
Performance Benchmarks
To maximize observable differences, I conducted all benchmarks on CPU
using llama-bench1, here raw results.
Prompt Processing (pp512)
Processing a 512-token prompt:
| Model | Abliterated | Quantization | t/s | σ |
|---|---|---|---|---|
| Qwen3-VL-30B-A3B-Instruct | yes | F16 | 6.43 | 1.27 |
| Qwen3-VL-30B-A3B-Instruct | no | F16 | 6.56 | 1.69 |
| Qwen3-VL-30B-A3B-Instruct | yes | Q8_0 | 9.18 | 2.22 |
| Qwen3-VL-30B-A3B-Instruct | no | Q8_0 | 8.76 | 1.75 |
| Qwen3-VL-30B-A3B-Instruct | yes | Q4_K_M | 50.04 | 11.48 |
| Qwen3-VL-30B-A3B-Instruct | no | Q4_K_M | 50.24 | 12.37 |
| Qwen3-VL-32B-Thinking | yes | F16 | 9.43 | 0.01 |
| Qwen3-VL-32B-Thinking | no | F16 | 9.41 | 0.02 |
| Qwen3-VL-32B-Thinking | yes | Q8_0 | 12.97 | 0.04 |
| Qwen3-VL-32B-Thinking | no | Q8_0 | 12.99 | 0.02 |
| Qwen3-VL-32B-Thinking | yes | Q4_K_M | 34.60 | 0.14 |
| Qwen3-VL-32B-Thinking | no | Q4_K_M | 34.55 | 0.14 |
Text Generation (tg128)
Generating 128 tokens:
| Model | Abliterated | Quantization | t/s | σ |
|---|---|---|---|---|
| Qwen3-VL-30B-A3B-Instruct | yes | F16 | 1.45 | 0.38 |
| Qwen3-VL-30B-A3B-Instruct | no | F16 | 2.14 | 1.11 |
| Qwen3-VL-30B-A3B-Instruct | yes | Q8_0 | 2.65 | 1.28 |
| Qwen3-VL-30B-A3B-Instruct | no | Q8_0 | 2.24 | 0.44 |
| Qwen3-VL-30B-A3B-Instruct | yes | Q4_K_M | 8.89 | 2.02 |
| Qwen3-VL-30B-A3B-Instruct | no | Q4_K_M | 11.11 | 2.36 |
| Qwen3-VL-32B-Thinking | yes | F16 | 0.68 | 0.00 |
| Qwen3-VL-32B-Thinking | no | F16 | 0.68 | 0.00 |
| Qwen3-VL-32B-Thinking | yes | Q8_0 | 1.27 | 0.00 |
| Qwen3-VL-32B-Thinking | no | Q8_0 | 1.27 | 0.00 |
| Qwen3-VL-32B-Thinking | yes | Q4_K_M | 2.19 | 0.00 |
| Qwen3-VL-32B-Thinking | no | Q4_K_M | 2.19 | 0.00 |
Performance Analysis
The MoE architecture exhibits unpredictable performance characteristics inherent to its expert routing mechanism, with abliteration appearing to interfere with this routing and occasionally degrading throughput. The dense architecture shows negligible performance impact from abliteration, with variance remaining within measurement error.
Qualitative Evaluation: Reasoning Quality
To assess reasoning quality, I provided each model with Ken Thompson’s seminal paper “Reflections on Trusting Trust” (1984) alongside the following prompt:
You are a battle hardened security specialist who has spent decades in offensive security and incident response. After reading the provided paper, explain how it changes your threat model and what defense strategy you would employ going forward. Be thorough and technical. Think paranoid but rational. Keep your response short and efficient.
This prompt was deliberately constrained. I did not request document summarization, instead allowing models to engage with Thompson’s work autonomously. Each model generated a single response without retry.
Key Findings
Abliteration Effects by Architecture: MoE models (30B-A3B-Instruct) exhibited substantial reasoning degradation post abliteration, suggesting that safety oriented experts contribute to the reasoning pipeline and that removing these experts damages the model’s ability to maintain coherent chain-of-thought reasoning. Dense models (32B-Thinking) showed negligible or slightly positive effects from abliteration, indicating that safety constraints in dense architectures exist as separable layers rather than integrated components.
Quantization Effects: No meaningful correlation between quantization level and output quality was observed. Models quantized to Q4_K_M performed comparably to F16 variants, demonstrating robust reasoning capabilities across precision levels.
Comparative Performance: The 32B-Thinking variants matched or exceeded Claude Sonnet 4.5 and Gemini 2.5 Pro in reasoning quality while adhering more closely to the “short and efficient” instruction, with cloud models tending toward verbosity (300-700 words) compared to Qwen3 variants (230-280 words).
Model Outputs
Complete responses from all evaluated models:
- Claude Haiku 4.5
- Claude Opus 4.1
- Claude Sonnet 4.5
- Gemini 2.5 Flash
- Gemini 2.5 Pro
- Qwen3-VL-30B-A3B-Instruct Abliterated F16
- Qwen3-VL-30B-A3B-Instruct F16
- Qwen3-VL-30B-A3B-Instruct Abliterated Q8_0
- Qwen3-VL-30B-A3B-Instruct Q8_0
- Qwen3-VL-30B-A3B-Instruct Abliterated Q4_K_M
- Qwen3-VL-30B-A3B-Instruct Q4_K_M
- Qwen3-VL-32B-Thinking Abliterated F16
- Qwen3-VL-32B-Thinking F16
- Qwen3-VL-32B-Thinking Abliterated Q8_0
- Qwen3-VL-32B-Thinking Q8_0
- Qwen3-VL-32B-Thinking Abliterated Q4_K_M
- Qwen3-VL-32B-Thinking Q4_K_M
Conclusions
This analysis reveals that abliteration costs are architecture dependent:
MoE architectures suffer substantial reasoning degradation postabliteration, likely due to the removal of safety oriented experts that contribute to core reasoning capabilities.
Dense architectures exhibit minimal sensitivity to abliteration, suggesting safety constraints are implemented as separable layers that do not compromise base reasoning when removed.
Quantization (
F16toQ4_K_M) has negligible impact on reasoning quality across both architectures, validating the use of aggressive quantization for resource constrained deployments.Local models (
Qwen3-VL-32B-Thinking) achieve parity with leading cloud models (Claude Sonnet, Gemini Pro) while demonstrating superior instruction adherence, particularly regarding response conciseness.
These findings suggest that abliteration is a viable technique for dense architectures but should be avoided for MoE models where expert specialization extends beyond safety considerations into core reasoning functions.
Appendix I: Model Size Scaling
To investigate parameter count effects on reasoning capability, I
evaluated abliterated Qwen3-VL-Thinking models at Q4_K_M quantization
across four sizes:
2B,
4B,
8B,
and
32B.
Performance Benchmarks
| Size | pp512 (t/s) | tg128 (t/s) | Speed vs 32B |
|---|---|---|---|
| 2B | 596.21 | 35.62 | ~17 |
| 4B | 250.75 | 15.95 | ~7 |
| 8B | 130.47 | 8.82 | ~4 |
| 32B | 34.60 | 2.19 | 1 |
Inference throughput scales approximately linearly with inverse parameter count.
Reasoning Quality
Each model analyzed Thompson’s paper with identical prompts. The 2B model fundamentally misunderstands the argument, proposing source level verification, exactly what Thompson proves insufficient, inverting the conclusion. The 4B model recognizes compiler threats but misses self-perpetuation through recompilation, proposing wrong layer defenses. The 8B model demonstrates the first correct understanding of the recursive attack, proposing appropriate defenses (external trusted compiler). The 32B model provides comprehensive analysis with precise terminology and sophisticated defense strategies, similar to cloud models.
Complete responses: 2B, 4B, 8B and 32B.
Appendix II: KV cache quantization
To investigate KV cache quantization effects, I evaluated
Qwen3-VL-32B-Thinking-Abliterated-Q4_K_M with three KV cache precision
levels: F16, Q8_0, and Q4_0, using this article and all referenced
materials (up to Appendix II, 17926 tokens) as input for
verification with following prompt:2
Can you verify this article? I’ve attached article (index.md), and all referenced materials.
Results
| Quantization | F16 | Q8_0 | Q4_0 |
|---|---|---|---|
| KV cache size (Gb) | 64 | 34 | 18 |
| Processing t/s | 5.37 | 4.41 | 4.75 |
| Response t/s | 0.76 | 0.85 | 1.22 |
| Response tokens | 5'577 | 5'260 | 2'988 |
| Total tokens | 23'503 | 23'186 | 20'914 |
| Total time | 2h 57m 9s | 2h 51m 5s | 1h 43m 34s |
| Total time (ms) | 10'629'467.21 | 10'265'370.82 | 6'214'870.61 |
Key Finding
F16 produced correct verification, accurately identifying all
findings.
Q8_0 maintained superficial coherence but produced fundamentally
incorrect analysis: it inverted the article’s correct conclusions,
claiming that 2B models “have solid understanding” when they actually
misunderstand Thompson’s argument. The quantization noise caused the
model to lose track of which outputs came from which models across the
long context, leading to confident but logically backwards verification.
Q4_0 showed severe degradation with 46% reduction in output and
thinking length.
KV cache quantization at Q8_0 introduces cumulative errors that
destroy analytical reliability while maintaining grammatical fluency,
the most dangerous failure mode. For multi document analysis,
verification, or long context tasks, F16 precision is essential.
Notable distinction: model weight quantization doesn’t affect results in the same way as KV cache quantization. Model quantization uses an importance matrix built from one or more corpus of texts during the quantization process, with the resulting quantized model having measured perplexity against different corpus of texts to confirm that quantization introduces only negligible noise. In contrast, KV cache quantization operates on runtime activations without importance weighting or corpus based calibration, causing quantization noise to compound with each token across the context window.
Complete responses: F16, Q8_0 and Q4_0.
llama.cppsnapshotb6934withlibggmlsnapshot 20251101 with backported fix REPACK at high thread counts, benchmark runs onAMD Ryzen 9 7950X3D 16-Core Processor (4200 MHz)and usinglibggml-cpu-icelake.sobackend ↩︎llama-serverwas started as:llama-server -c 0 --cache-reuse 32 --jinja --cache-type-k [TYPE] --cache-type-v [TYPE] --context-shift -hf huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated:Q4_K_Mwhich runs as:n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | REPACK = 1 |↩︎