jtar处理字符

前置知识

Unicode

UCS-2：这种形式使用2个字节来表示一个字符，最多可以表示65536个字符

UTF-8

UTF-8是一种变长的编码方式，它遵循Unicode规范。UTF-8编码的字符长度不固定，常用的中文字符占3个字节，不常用的中文字符占4个字节1。

16进制

Unicode就是16进制

Unicode、UTF-8、UTF-16 终于懂了 - 知乎

D3CTF-2025-Official-Writeup/D3CTF-2025-Official-Writeup-CN.pdf at main · D-3CTF/D3CTF-2025-Official-Writeup · GitHub

这题如果直接上传jsp会被拦截，所以要寻找绕过的方式。

jtar中的org.kamranzafar.jtar.TarHeader#getNameBytes错误的使用了byte进行强制类型转换。Java中的byte是8位二进制。所以只会截取两个字符组成的Unicode的低位。

Java byte类型详解：取值范围与运算-CSDN博客

import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        // 创建一个包含Unicode字符的文件名
        String name = "test.jsŰ";

        // 错误方式：使用(byte)强制转换每个字符
        byte[] buf = new byte[name.length()];
        for (int i = 0; i < name.length(); i++) {
            // 强制转换为byte，会导致高位截断
            buf[i] = (byte) name.charAt(i);
        }
        String forcedFileName = new String(buf, StandardCharsets.ISO_8859_1);
        System.out.println("\n错误方式 - 使用(byte)强制转换:");
        System.out.println("原始文件名: " + name);
        System.out.println("转换后文件名: " + forcedFileName);
        System.out.println("是否匹配: " + name.equals(forcedFileName));
    }
}

用python脚本来寻找适合的字符

# 目标字节值（对应 ASCII 字符 'p'）
target_byte_value = 0x70

# 存储匹配的字符及其 Unicode 编码
matching_chars = []

# 遍历所有基本多文种平面(BMP)的 Unicode 码点 (0x0000-0xFFFF)
for code_point in range(0x10000):
    # 检查码点的低字节是否等于目标字节值
    if (code_point & 0xFF) == target_byte_value:
        try:
            # 将码点转换为对应的 Unicode 字符
            char = chr(code_point)
            # 存储字符及其 Unicode 编码（十六进制格式）
            matching_chars.append((char, code_point))
        except ValueError:
            # 处理可能的无效码点（理论上不会发生在 BMP 范围内）
            continue

# 打印匹配结果
print(f"找到 {len(matching_chars)} 个低字节为 0x{target_byte_value:02X} 的 Unicode 字符：")
for char, code_point in matching_chars:
    # 获取码点的16位二进制表示
    binary_str = format(code_point, '016b')
    # 将二进制字符串分为高8位和低8位
    high_byte_bin = binary_str[:8]
    low_byte_bin = binary_str[8:]
    
    print(f"字符: {char} | Unicode: {hex(code_point)} | "
          f"二进制: {binary_str} ({high_byte_bin} {low_byte_bin})")