admin管理员组

文章数量:1794759

NEO4J

NEO4J

说明:使用neo4j算法库时需引入跟neo4j数据库对应的算法库插件或自定义算法库

1.简介

重叠相似度算法就是先把两个向量表示成两个长度相等得一维坐标,即映射到一维空间,再进行重合度加权求和,它即不关注两个向量得夹角,也不关注向量之差得长度值。

其向量公式如下:

数学计算公式如下:

,其中分母是作为一个归一化因子,其中成为共同维度函数,O(x_{i},y_{i}) }为重合度函数

2.使用场景

我们可以使用重叠相似性算法来计算出哪些事物是其他事物的子集。此算法对两个事物之间本身关联关系数据不必要求数量一致,然后我们可能会使用这些计算出的子集从标记数据中学习分类法。比如再常见数据挖掘分析过程中

3.neo4j中重叠度函数使用示例

neo4j中提供如下函数和存储过程,根据第四小节源码分析可知,其函数适合两个点比较重叠度,存储过程适合多个点比较重叠度

  • algo.similarity.overlap函数,入参为两个List<Number集合>
  • algo.similarity.overlap.stream存储过程,入参(List<Map<String,Object>> data,
    Map<String, Object> config)
  • algo.similarity.overlap存储过程,入参(List<Map<String,Object>> data,
    Map<String, Object> config)

1.计算两个硬编码重叠度

RETURN algo.similarity.overlap([1,2,3], [1,2,4,5]) AS similarity

结果:0.6666666666666666

2.初始化节点数据

MERGE (fahrenheit451:Book {title:'Fahrenheit 451'})
MERGE (dune:Book {title:'Dune'})
MERGE (hungerGames:Book {title:'The Hunger Games'})
MERGE (nineteen84:Book {title:'1984'})
MERGE (gatsby:Book {title:'The Great Gatsby'})MERGE (scienceFiction:Genre {name: "Science Fiction"})
MERGE (fantasy:Genre {name: "Fantasy"})
MERGE (dystopia:Genre {name: "Dystopia"})
MERGE (classics:Genre {name: "Classics"})MERGE (fahrenheit451)-[:HAS_GENRE]->(dystopia)
MERGE (fahrenheit451)-[:HAS_GENRE]->(scienceFiction)
MERGE (fahrenheit451)-[:HAS_GENRE]->(fantasy)
MERGE (fahrenheit451)-[:HAS_GENRE]->(classics)MERGE (hungerGames)-[:HAS_GENRE]->(scienceFiction)
MERGE (hungerGames)-[:HAS_GENRE]->(fantasy)
MERGE (hungerGames)-[:HAS_GENRE]->(romance)MERGE (nineteen84)-[:HAS_GENRE]->(scienceFiction)
MERGE (nineteen84)-[:HAS_GENRE]->(dystopia)
MERGE (nineteen84)-[:HAS_GENRE]->(classics)MERGE (dune)-[:HAS_GENRE]->(scienceFiction)
MERGE (dune)-[:HAS_GENRE]->(fantasy)
MERGE (dune)-[:HAS_GENRE]->(classics)MERGE (gatsby)-[:HAS_GENRE]->(classics)

3.计算节点之间得交集和重叠相似性

MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to,count1, count2, intersection, similarity
ORDER BY similarity DESC

结果:count1为跟from相关得节点,count2为跟to相关得节点,intersection为共同节点数,similarity为相似度

 4.对相似度计算结果进行条件筛选,增加条件相似度大于等于0.75的

MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data, {similarityCutoff: 0.75})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to,count1, count2, intersection, similarity
ORDER BY similarity DESC

结果:

 5.为每个节点找到最相似的节点,并存储这些节点之间的关系

MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap(data, {topK: 2, similarityCutoff: 0.5, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

结果:

 6.指定源和目标ID:有时,我们不想计算所有对的相似性,而是希望指定项目的子集来相互比较。我们使用配置中的sourceIdstargetIds键来做到这一点

MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), name: genre.name, categories: collect(id(book))} as userData
WITH collect(userData) as data
WITH data,[value in data WHERE value.name IN ["Fantasy", "Classics"] | value.item ] AS sourceIds
CALL algo.similarity.overlap.stream(data, {sourceIds: sourceIds})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC

结果:

 4.源码解析

algo.similarity.overlap函数:
    public double overlapSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) {if (vector1 == null || vector2 == null) return 0;HashSet<Number> intersectionSet = new HashSet<>(vector1);intersectionSet.retainAll(vector2);int intersection = intersectionSet.size();long denominator = Math.min(vector1.size(), vector2.size());return denominator == 0 ? 0 : (double) intersection / denominator;}

algo.similarity.overlap存储过程:

    public Stream<SimilaritySummaryResult> overlap(@Name(value = "data", defaultValue = "null") List<Map<String, Object>> data,@Name(value = "config", defaultValue = "{}") Map<String, Object> config) {ProcedureConfiguration configuration = ProcedureConfiguration.create(config);CategoricalInput[] inputs = prepareCategories(data, getDegreeCutoff(configuration));String writeRelationshipType = configuration.get("writeRelationshipType", "NARROWER_THAN");String writeProperty = configuration.getWriteProperty("score");if(inputs.length == 0) {return emptyStream(writeRelationshipType, writeProperty);}long[] inputIds = SimilarityInput.extractInputIds(inputs);int[] sourceIndexIds = indexesFor(inputIds, configuration, "sourceIds");int[] targetIndexIds = indexesFor(inputIds, configuration, "targetIds");SimilarityComputer<CategoricalInput> computer = similarityComputer(sourceIndexIds, targetIndexIds);SimilarityRecorder<CategoricalInput> recorder = categoricalSimilarityRecorder(computer, configuration);double similarityCutoff = getSimilarityCutoff(configuration);Stream<SimilarityResult> stream = topN(similarityStream(inputs, sourceIndexIds, targetIndexIds, recorder, configuration, () -> null, similarityCutoff, getTopK(configuration)), getTopN(configuration));boolean write = configuration.isWriteFlag(false) && similarityCutoff > 0.0;return writeAndAggregateResults(stream, inputs.length, sourceIndexIds.length, targetIndexIds.length, configuration, write, writeRelationshipType, writeProperty, recorder);}

algo.similarity.overlap.stream存储过程:

    public Stream<SimilarityResult> similarityStream(@Name(value = "data", defaultValue = "null") List<Map<String,Object>> data,@Name(value = "config", defaultValue = "{}") Map<String, Object> config) {ProcedureConfiguration configuration = ProcedureConfiguration.create(config);CategoricalInput[] inputs = prepareCategories(data, getDegreeCutoff(configuration));if(inputs.length == 0) {return Stream.empty();}long[] inputIds = SimilarityInput.extractInputIds(inputs);int[] sourceIndexIds = indexesFor(inputIds, configuration, "sourceIds");int[] targetIndexIds = indexesFor(inputIds, configuration, "targetIds");SimilarityComputer<CategoricalInput> computer = similarityComputer(sourceIndexIds, targetIndexIds);return topN(similarityStream(inputs, sourceIndexIds, targetIndexIds, computer, configuration, () -> null, getSimilarityCutoff(configuration), getTopK(configuration)), getTopN(configuration));}

本文标签: Neo4j