概述
TextRank(文本摘要)起源于PageRank,是基于图的文本处理排序模型,可应用于各种自然语言处理任务,如关键字提取、关键短语提取和文本摘要。
- R. Mihalcea, P. Tarau, TextRank: Bringing Order Into Texts (2004)
基本概念
转换文本为图集
应用TextRank算法时,需先将文本表示为图。图结构取决于具体应用场景:
- 点:将最适合文本处理任务的文本单元(如字、词或句)作为节点添加至图集中。
- 边:文本单元之间的关系,如语义相似度、共现或上下文重叠,是连接各点的边。

构建提取关键词语的示例图:节点是从文本中选择的词汇单元,边是根据定义的单词窗口里的共现关系建立的(来源:原论文)
TextRank模型
所有文本单元的排名由“推荐”机制递归计算获得,过程与PageRank算法类似。然而TextRank使用了改进后的公式,将边权重考虑在内:

其中,
- Out(v)代表点v指向的点集合;
- wvu代表点v和点u间的边权重;
- d是阻尼系数。
特殊说明
- 孤立文本单元的排名与(1 - d)的值相同。
- 自环既被视作后继,也被视为前驱,节点通过自环将排名传递给自身。如果网络中有许多自环,算法需要更多次迭代才能收敛。
示例图

创建示例图集:
ALTER EDGE default ADD PROPERTY {
  weight int32
};
INSERT (A:default {_id: "A"}),
       (B:default {_id: "B"}),
       (C:default {_id: "C"}),
       (D:default {_id: "D"}),
       (E:default {_id: "E"}),
       (F:default {_id: "F"}),
       (G:default {_id: "G"}),       
       (A)-[:default {weight: 3}]->(E),
       (B)-[:default {weight: 3}]->(A),
       (B)-[:default {weight: 2}]->(E),
       (C)-[:default {weight: 1}]->(A),
       (C)-[:default {weight: 4}]->(D),
       (D)-[:default {weight: 5}]->(E),
       (E)-[:default {weight: 2}]->(G),
       (F)-[:default {weight: 1}]->(B),
       (F)-[:default {weight: 3}]->(G);
create().edge_property(@default, "weight", int32);
insert().into(@default).nodes([{_id:"A"}, {_id:"B"}, {_id:"C"}, {_id:"D"}, {_id:"E"}, {_id:"F"}, {_id:"G"}]);
insert().into(@default).edges([{_from:"A", _to:"E", weight:3}, {_from:"B", _to:"A", weight:3}, {_from:"B", _to:"E", weight:2}, {_from:"C", _to:"A", weight:1}, {_from:"C", _to:"D", weight:4}, {_from:"D", _to:"E", weight:5}, {_from:"E", _to:"G", weight:2}, {_from:"F", _to:"B", weight:1}, {_from:"F", _to:"G", weight:3}]);
创建HDC图
将当前图集全部加载到HDC服务器hdc-server-1上,并命名为 my_hdc_graph:
CREATE HDC GRAPH my_hdc_graph ON "hdc-server-1" OPTIONS {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static"
}
hdc.graph.create("my_hdc_graph", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static"
}).to("hdc-server-1")
参数
算法名:text_rank
| 参数名 | 类型 | 规范 | 默认值 | 可选 | 描述 | 
|---|---|---|---|---|---|
| init_value | Float | >0 | 0.2 | 是 | 所有点的初始排名 | 
| loop_num | Integer | ≥1 | 5 | 是 | 最大迭代轮数。算法将在完成所有轮次后停止 | 
| damping | Float | (0,1) | 0.8 | 是 | 阻尼系数 | 
| max_change | Float | ≥0 | 0 | 是 | 某轮迭代后,若所有点的排名变化小于指定 max_change时,表明结果已稳定,算法会停止。设置为0时停用此标准 | 
| edge_schema_property | []" <@schema.?><property>" | / | / | 否 | 作为权重的数值类型边属性,权重值为所有指定属性值的总和;不包含指定属性的边将被忽略 | 
| return_id_uuid | String | uuid,id,both | uuid | 是 | 在结果中使用 _uuid、_id或同时使用两者来表示点 | 
| limit | Integer | ≥-1 | -1 | 是 | 限制返回的结果数; -1返回所有结果 | 
| order | String | asc,desc | / | 是 | 根据 rank分值对结果排序 | 
文件回写
CALL algo.text_rank.write("my_hdc_graph", {
  return_id_uuid: "id",
  init_value: 1,
  loop_num: 50,
  damping: 0.8,
  edge_schema_property: "weight",
  order: "desc"
}, {
  file: {
    filename: "textrank"
  }
})
algo(text_rank).params({
  projection: "my_hdc_graph",
  return_id_uuid: "id",
  init_value: 1,
  loop_num: 50,
  damping: 0.8,
  edge_schema_property: "weight",
  order: 'desc'
}).write({
  file: {
    filename: "textrank"
  }
})
结果:
_id,text_rank
G,0.973568
E,0.81696
A,0.3472
D,0.328
B,0.24
F,0.2
C,0.2
数据库回写
将结果中的text_rank值写入指定点属性。该属性类型为double。
CALL algo.text_rank.write("my_hdc_graph", {
  loop_num: 50,
  edge_schema_property: "@default.weight"
}, {
  db: {
    property: "rank"
  }
})
algo(text_rank).params({
  projection: "my_hdc_graph",
  loop_num: 50,
  edge_schema_property: "@default.weight"
}).write({
  db:{ 
    property: "rank"
  }
})
完整返回
CALL algo.text_rank.run("my_hdc_graph", {
  return_id_uuid: "id",    
  init_value: 1,
  loop_num: 50,
  damping: 0.8,
  edge_schema_property: "weight",
  order: "desc",
  limit: 5
}) YIELD TR
RETURN TR
exec{
  algo(text_rank).params({
    return_id_uuid: "id",    
    init_value: 1,
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: "desc",
    limit: 5
  }) as TR
  return TR
} on my_hdc_graph
结果:
| _id | text_rank | 
|---|---|
| G | 0.973568 | 
| E | 0.81696 | 
| A | 0.3472 | 
| D | 0.328 | 
| B | 0.24 | 
流式返回
CALL algo.text_rank.stream("my_hdc_graph", {
  return_id_uuid: "id",
  loop_num: 50,
  damping: 0.8,
  edge_schema_property: "weight",
  order: "desc",
  limit: 5
}) YIELD TR
RETURN TR
exec{
  algo(text_rank).params({
    return_id_uuid: "id",
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: "desc",
    limit: 5
  }).stream() as TR
  return TR
} on my_hdc_graph
结果:
| _id | text_rank | 
|---|---|
| G | 0.973568 | 
| E | 0.81696 | 
| A | 0.3472 | 
| D | 0.328 | 
| B | 0.24 | 
