[Teammetrics-discuss] Idea for improving performance.

Vipin Nair swvist at gmail.com
Fri Jun 29 07:51:00 UTC 2012


Hi Sukhbir/Andreas,

I have got an idea. I think there is a much better way to do things
than what I am doing currently.

Currently we have a script that parses data from a text file and
produces the graphs. The data in the text file is generated by
processing the data in the database. This needs to be done only once a
month since there are no changes in the database during the month.

The way I am doing it, each time someone requests some data over the
API, I am doing the all the processing once again. I can avoid this by
reading the data from the text file (like I did in my prototype) but
reading from files is slow and filtering specific data from a text
file would be a pain.

What I propose is that we maintain another database that contains the
data from the text files. So we can add a script that updates the data
in the new database and run it with other scripts each month. So
instead of the heavy processing (some sql queries take upto 9ms to
fetch the data) every time, all data access reduces to a simple select
query and joins are completely avoided. This makes the site faster and
reduced the load on the server drastically.

Instead of storing all this processed data in Postgres, this data can
be stored in a nosql database and all our data retrieval cost can be
made to O(1). Irrespective of the choice of the database, having a 2nd
database will improve the performance dramatically.

Why Nosql database?
1) Our data is essentially read-only(from users perspective)
2) No data joins required.

Advantages of NoSql database
1) Fast (Very fast)
2) Efficient (All queries will be O(1) in our use case)

Which NoSql database?
We have multiple options here, I'll pick the once I am comfortable with.

  1) CouchDB
 - CouchDB stores all the data in JSON format so we can directly serve
the data without any serialization.
 - Client side Javascript can directly access CouchDB so intermediate
processing(in python) can be avoided.
 - Getting the security setting right is a pain but keeping a web
server in front helps a lot.
 - I have good experience with CouchDb and have used it in a
production quality app.

 2) Redis - Ideal database for our use case.
 - In memory database but persistant, so data access times are reduced greatly.
 - Version is Squeeze is old, but I am not sure if it affects us.
 - I have used it few times but never in a production version app.


So here is the basic monthly workflow, if we do the above steps:

1) Run script to populate current database

2) Run scripts to generate the text file

3)* Run scripts to store the same data (processed in step2) and
populate the 2nd db.

The web interface accesses the data in the 2nd database and presents
the data to the user.



What do you guys think of this?

-- 
Regards,
Vipin Nair
National Institute of Technology, Calicut
http://swvist.github.com



More information about the Teammetrics-discuss mailing list