[Soc-coordination] [GSoC 2014] Metrics Portal project

Nikolay Baluk kolyaflash at gmail.com
Sat Mar 8 16:59:18 UTC 2014


Hi, Stefano, Stuart and the Debian community!


I don’t know how things are going with “clamouring hordes wanting to pick
Debian Metrics Portal project”, but this project definitely worth it :)

A couple of days I’m thinking about it and now I have some ideas about
details of it, as well as I have some unclear parts.

Prelude:

Most of the modern data visualization software gathering data with
agent-server architecture (Munin, Newrelic) or with some entry-point for
data (Graphite, RRDTool) local or via web.

The choice of approach is due to the nature of the data. For example, the
first option is great for gathering information about the the server
environment (load state, performance) since it provides a plugin-based
agent, but the second is better to get information from third-parties,
because actually represents API.

In our case, we are faced with a very different data. As I understand it,
we need to be able to show "normal" data from the database in which there
is an time-value point and not regular points from some file with the
number of the GSoC proposals for period or a pie chart of using different
versions of Debian.

Besides this data is represented in the different formats - it's not
homogeneous. For example RRDTool clearly defines what is to be received (in
some cases with specified intervals), that allows it to easily handle
incoming data and later apply filters and customize displaying according to
the configuration.

The task is complicated by the fact that not everyone who already place
stats on https://wiki.debian.org/Statistics will be willing to provide data
in Debian Metrics Portal (DMP) format or convert it. In other cases it is
simply not realistic, such as the disk I/O metric, when in addition to that
user is responsible to send one to DMP, he is also has to get it from
somewhere (in the sense that if we are replacing Munin with DMP).

In other hand, such data sources as a RDB is a little easier to handle.
They already have some scheme so we can just let user specify which column
is a timestamp and which is value or even do it ourselves.


The main types of data sources for charts (from
https://wiki.debian.org/Statistics):

   -

   RRDTool storage
   Example:
   https://ftp-master.debian.org/stat.html
   -

   flexible (which means that the owner of the graphs is not difficult to
   go to our specific format)
   Example:
   https://bugs.debian.org/release-critical/
   https://buildd.debian.org/stats/
   http://davesteele.github.io/debian-rfs-stats/
   -

   RDBMS
   Example:
   UDD
   and I guess this is also from DB:
   http://asdfasdf.debian.net/~tar/bugstats/
   -

   Plain text
   Example:
   http://qa.debian.org/watch/uscan-status-stats.txt
   -

   Undefined. Sources that may be available in hard-to-parse format, like
   HTML page.
   Example:
   http://ircbots.debian.net/factoids/stats.php?q=recently-created


I would suggest:

1.

Architecture. The agent-server is not optimal, since we will not always
have the opportunity and the need to run the agent on the server, in most
cases, this approach will be redundant (eg UDD).

Data-entry-point also has some trouble spots, but to solve them will be
much easier.


So, we will provide local and network entry points (DMP API).

Example use cases (actually DMP API wrappers):

Console:

Create simple shell script.

$ dmp-client add <metric_id> <value> <timestamp>

Web:

Run lightway web server on non-80 port for web interection.

$.ajax({

 type: "POST",

 url: “http://192.168.0.100:4242/dmp-rpc/”,

data: {

“metric_id”: “website_visit”,

“timestamp”: “auto”,

},

});

Remote:

Trivial way in most software is a open socket. Available local and remote:

echo "users_online 42 <timestamp>" | nc -q0 192.168.0.100 4241

2. Allow multiple display formats (adjusted for certain metrics, depending
on the data type): charts, diagrams, etc.

3.  In addition to DMP API, collect data as follows:

In the admin web interface user set a “path” to data: it could be a plain
text file (via ftp, http or local), RRD database, mysql database, etc. With
the given path we are trying to get that data and ask user in our
super-web-gui-import-tool to specify how we should treat “columns” (let’s
say separated by “,”) e.g. as a timestamp or value. Or do it ourselves if
we know that format (like RRD). It’s like how import products from any .xls
table works in my software.

Next, user set the frequency of data synchronization and enjoying graphs in
DMP and in his system simultaneously. This also allow us to get data from
things like Munin without any problems and no need to change anything in
the existing system.

Then user may move to the DMP API method, which is better, without loosing
data.

This raises the question about the duplication of data.

For example, this method is applicable to UDD source. However, copying all
the data from the remote database to DMP is excessively. But every time
directly receive data from UDD is also not flexible.

4. Provide support for real-time graphs for metrics such as "Online right
now", which is not required to store a lot of data, since it is necessary
to analyze only the current state at the current moment.

5. DMP API - not only the entry point to send data. It also a way to
querying it.

6. Maybe provide an event-collector API for event-based data, so DMP would
be also a open-source alternative of Google Analytics-like software.

So, all this should satisfy all data source types with more or less things
to code. But in some cases, the owner will still have to go to the new
format.

-

In short, I have a misunderstanding with how to receive/gather data. How to
add metrics in admin-panel, store them and display - I’ve done it before
and I have some ideas about it.

Some feedback will help me to begin work on prototype (not just a script
that uses matplotlib to graph some metrics, but a web app as a proof of
concept). I think this project has the potential to become a great
open-source solution to help Debian and other projects make their software
better by analysis their stats.

-- 
Regards,
Nikolay.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/soc-coordination/attachments/20140308/5486d604/attachment-0001.html>


More information about the Soc-coordination mailing list