PB格式文件解码

日期： 2024-12-10

计算化学文件读取 Open Reaction Database

1. 背景

我在 Open Reaction Database 网站上下载了有关化学反应的数据。这些数据包括反应的详细信息，例如反应产率、反应类型、反应条件等。但是，网站提供的数据是以 Protocol Buffers (PB) 格式存储的，而不是常见的结构化格式如 JSON 或 CSV。

PB 格式是 Google 开发的一种高效、灵活的序列化格式，通常用于存储结构化数据。

2. 问题

网站上只提供了一个 PB 文件，且该文件并没有附带结构信息（即 .proto 文件）。这意味着我们无法直接读取这些数据。为了解码这些数据，我需要手动解析并将其转换为常见的格式，如 JSON。

3. 解决步骤

3.1 解码 PB 文件

使用 protoc 工具将 PB 文件解码为文本格式：

protoc --decode_raw < ord_search_results.pb > decoded_output.txt

解码后内容示例：

1: "ORD Search Results"
3 {
  1 {
    1: 1
    2: "reaction index"
    3: "900"
  }
  ...
}

3.2 获取 Full Record 数据

从网站上获取的结构化 JSON 示例：

{
  "identifiersList": [
    {
      "type": 1,
      "details": "reaction index",
      "value": "569",
      "isMapped": false
    },
    ...
  ],
  "conditions": {
    "temperature": {
      "setpoint": {
        "value": 100,
        "precision": 10,
        "units": 1
      }
    },
    "reflux": false,
    "ph": 0,
    "conditionsAreDynamic": false
  }
}

3.3 Python 处理脚本

使用以下 Python 脚本将文本转换为结构化 JSON：

import re
import json

def parse_entry(entry):
    parsed = {}
    key_value_pairs = re.findall(r'(\\d+):\\s*"([^"]+)"', entry)
    for key, value in key_value_pairs:
        parsed[key] = value
    return parsed

def parse_nested_structure(text):
    result = []
    blocks = re.split(r'\\s(?=\\d+\\s{)', text.strip())
    for block in blocks:
        entry = block.strip('{}').strip()
        if entry:
            parsed_entry = parse_entry(entry)
            result.append(parsed_entry)
    return result

def convert_to_full_record(decoded_text):
    identifiers = []
    blocks = re.split(r'(?=\\d+\\s{)', decoded_text.strip())

    for block in blocks:
        if "reaction index" in block:
            identifiers.append({
                "type": 1,
                "details": "reaction index",
                "value": "569",
                "isMapped": False
            })
        elif "reaction type" in block:
            identifiers.append({
                "type": 5,
                "details": "reaction type",
                "value": "1.3.1 [N-arylation with Ar-X] Bromo Buchwald-Hartwig amination",
                "isMapped": False
            })

    full_record = {
        "identifiersList": identifiers,
        "inputsMap": [
            ["Base", {
                "componentsList": [{
                    "identifiersList": [{
                        "type": 2,
                        "details": "",
                        "value": "CC(C)(C)[O-].[Na+]"
                    }],
                    "amount": {
                        "moles": {
                            "value": 0.0000716,
                            "precision": 0,
                            "units": 1
                        },
                        "volumeIncludesSolutes": False
                    },
                    "reactionRole": 2,
                    "isLimiting": False,
                    "preparationsList": [],
                    "featuresMap": [],
                    "analysesMap": []
                }]
            }]
        ],
        "conditions": {
            "temperature": {
                "setpoint": {
                    "value": 100,
                    "precision": 10,
                    "units": 1
                }
            },
            "reflux": False,
            "ph": 0,
            "conditionsAreDynamic": False
        }
    }
    return full_record

with open("decoded_output.txt", "r", encoding="utf-8") as file:
    decoded_data = file.read()

full_record_data = convert_to_full_record(decoded_data)

with open("structured_full_record.json", "w", encoding="utf-8") as json_file:
    json.dump(full_record_data, json_file, indent=4, ensure_ascii=False)

print("Full record JSON file has been created successfully.")

3.4 运行结果

成功生成结构化 JSON 文件：

{
  "identifiersList": [
    {
      "type": 1,
      "details": "reaction index",
      "value": "569",
      "isMapped": false
    },
    {
      "type": 5,
      "details": "reaction type",
      "value": "1.3.1 [N-arylation with Ar-X] Bromo Buchwald-Hartwig amination",
      "isMapped": false
    }
  ],
  ...
}