日期: 2024-12-10
我在 Open Reaction Database 网站上下载了有关化学反应的数据。这些数据包括反应的详细信息,例如反应产率、反应类型、反应条件等。 但是,网站提供的数据是以 Protocol Buffers (PB) 格式存储的,而不是常见的结构化格式如 JSON 或 CSV。
PB 格式是 Google 开发的一种高效、灵活的序列化格式,通常用于存储结构化数据。
网站上只提供了一个 PB 文件,且该文件并没有附带结构信息(即 .proto 文件)。 这意味着我们无法直接读取这些数据。为了解码这些数据,我需要手动解析并将其转换为常见的格式,如 JSON。
使用
protoc
工具将 PB 文件解码为文本格式:
protoc --decode_raw < ord_search_results.pb > decoded_output.txt
解码后内容示例:
1: "ORD Search Results"
3 {
1 {
1: 1
2: "reaction index"
3: "900"
}
...
}
从网站上获取的结构化 JSON 示例:
{
"identifiersList": [
{
"type": 1,
"details": "reaction index",
"value": "569",
"isMapped": false
},
...
],
"conditions": {
"temperature": {
"setpoint": {
"value": 100,
"precision": 10,
"units": 1
}
},
"reflux": false,
"ph": 0,
"conditionsAreDynamic": false
}
}
使用以下 Python 脚本将文本转换为结构化 JSON:
import re
import json
def parse_entry(entry):
parsed = {}
key_value_pairs = re.findall(r'(\\d+):\\s*"([^"]+)"', entry)
for key, value in key_value_pairs:
parsed[key] = value
return parsed
def parse_nested_structure(text):
result = []
blocks = re.split(r'\\s(?=\\d+\\s{)', text.strip())
for block in blocks:
entry = block.strip('{}').strip()
if entry:
parsed_entry = parse_entry(entry)
result.append(parsed_entry)
return result
def convert_to_full_record(decoded_text):
identifiers = []
blocks = re.split(r'(?=\\d+\\s{)', decoded_text.strip())
for block in blocks:
if "reaction index" in block:
identifiers.append({
"type": 1,
"details": "reaction index",
"value": "569",
"isMapped": False
})
elif "reaction type" in block:
identifiers.append({
"type": 5,
"details": "reaction type",
"value": "1.3.1 [N-arylation with Ar-X] Bromo Buchwald-Hartwig amination",
"isMapped": False
})
full_record = {
"identifiersList": identifiers,
"inputsMap": [
["Base", {
"componentsList": [{
"identifiersList": [{
"type": 2,
"details": "",
"value": "CC(C)(C)[O-].[Na+]"
}],
"amount": {
"moles": {
"value": 0.0000716,
"precision": 0,
"units": 1
},
"volumeIncludesSolutes": False
},
"reactionRole": 2,
"isLimiting": False,
"preparationsList": [],
"featuresMap": [],
"analysesMap": []
}]
}]
],
"conditions": {
"temperature": {
"setpoint": {
"value": 100,
"precision": 10,
"units": 1
}
},
"reflux": False,
"ph": 0,
"conditionsAreDynamic": False
}
}
return full_record
with open("decoded_output.txt", "r", encoding="utf-8") as file:
decoded_data = file.read()
full_record_data = convert_to_full_record(decoded_data)
with open("structured_full_record.json", "w", encoding="utf-8") as json_file:
json.dump(full_record_data, json_file, indent=4, ensure_ascii=False)
print("Full record JSON file has been created successfully.")
成功生成结构化 JSON 文件:
{
"identifiersList": [
{
"type": 1,
"details": "reaction index",
"value": "569",
"isMapped": false
},
{
"type": 5,
"details": "reaction type",
"value": "1.3.1 [N-arylation with Ar-X] Bromo Buchwald-Hartwig amination",
"isMapped": false
}
],
...
}