Processamento de Dados Massivos/Projeto e implementação de aplicações Big Data/Processamento de streams de tweets: diferenças entre revisões

Saltar para a navegação Saltar para a pesquisa
[edição não verificada][edição não verificada]
The Geeser Project propose several stages. In each stage, a different activity is added to work on the stream parallely. For instance, word count and trending topics on second stage and entity disambiguation on the third stage.
 
For each stage, I propose to write a set of spouts and bolts that are necessary to reach the corresponding objective. That way, developers can mount a topology according their necessity.:
 
# Stage 1: Basic Spouts and Raw Tweet textual processing
# Stage 2: Word Counting and Trending Topics Bolts
# Stage 3: Entity Disambiguation Bolt
 
===Communication patterns===
 
==== Topology Builder ====
Here is the example of the topology builder:
<syntaxhighlight lang="java">
TopologyBuilder builder = new TopologyBuilder();
 
builder.setSpout("spout", new TwitterFileSpout(), 1);
builder.setBolt("project", new FilterTweet(), 5)
.shuffleGrouping("spout");
builder.setBolt("filter", new FilterTweet(), 5)
.shuffleGrouping("project");
 
builder.setBolt("print", new PrinterBolt(), 1)
.shuffleGrouping("filter");
 
Config conf = new Config();
conf.setDebug(false);
 
conf.setNumWorkers(3);
if(args!=null && args.length > 0) {
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
} else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
cluster.shutdown();
}
}
</syntaxhighlight>
====Communication====
Communication process is completely abstracted from the developer. We only need to minimize the consuption of network bandwith since Storm does not manage it well.
 
===Nimbus Logs===
Now I am showing the final output of the proposed topology. It process raw tweet to return trigrams without hashtags, char repetition, links, trigrams generations and other details.
[[File:FinalLog.png|centro|Tweet Processing Topology log]]
 
==Conclusion==
In this project, it is shown that it is possible to create a complex distributed system for processing massive tweet streams. This system is scalable and very flexible. Topologies can be modified for several different purposes making it ideal to the Web Observatory and Data Science research projects. Even though, speedup gains may be not aplicable (since we should be able to serially process big data in a single computer), there is an enourmous gain in scalability.
 
The main contribution here is to check the viability and add knowledge to Web Observatory of Storm framework. This framework helps not only software development but make it possible to deal with massive streams. The gain for this project is imensurable in many terms.
14

edições

Menu de navegação