[Freedombox-discuss] is a distributed search engine (e.g. YaCy) to be part of the FB package?

Sat Jul 16 20:45:34 UTC 2011

> using DDG or Scroogle - isn't Leaving the Cloud

> Not missing this, because a lot of us are using it

for sure, and even Seeks Project - isn't Leaving the Cloud.
the sad thing that - even Seeks Project PM says that
distributed search (crawl/indexing) isn't possible here/now/etc.
it's complicated etc.
I know Many people that not mentioning Search because they are
simply - surrendered the idea of "something" that could rule-out
Google and others.
-
What i know about Search - even crawling everything is doable, there are tons
of scientific materials on the theme of distributed crawling, search
giants are doing
the distributed crawling and where it could be said - that's not for
client machines
(however i believe it surely - is) - flavor of FBox(maybe not Plug as
it could be weak on
performance, but some other types of server, like these small
multimedia PC centers)
could perform the needed tasks.

I find a link where some
people crawl a 3rd part of Egypt internet in a week or two, That's Not
So Bad! - it may
be not as constant like Googles but there are tons of algorithms about
how to cleverly
renew the results, re-use that is already here etcetc. Modern research
of p2p crawling, of distributed
indexing are here for decade+
It clearly shows that scientific/academic world are constantly excited
with this idea, given the existent
initiatives like, let's say grub.org (non Java) that are evolving and
reinventing again and again
for that new Search goal - i think  - infrastructure based on FBoxes
could be here and people that could provide
resources are already working.

I won't say something unreal to this list because of affection with
Idea, believe me,
i'v done the needed research, used many flavors of meta-search,
contributed to FBox wiki
on this theme,
, i'm part of the team that  developing Search for our
OT(concurrent editing) and XMPP social networks, based on Robots
insiders/Pushing,
to look - not so dumb comparing to todays crawling practice.

What i can say for sure - constantly polling - crawling is a bad
method and unsuited for next-gen -- Real-Time search,
even for Corporations like Google with all the resource. And believe
me - Google+ won't use the
crawling for their network, there are articles - describing that their
previous +, Wave, many of their
initiatives were constructed with idea of Real-Time - based on Pushing
the result.
Now - is very interesting time when not-so-big start-ups like
Superfeedr.com or Collecta.com
could, eventually compete with Google etc. on real-time search, so do we.
Thing that does suit - Push-search as contra to Pull-crawling.
That techniques - if would be massively used - could introduce the
whole New World
where Search companies would compete on UI-UX and brave ideas, not on
closed DB's.
Will it be PubSubHubBub, Xep-0060 PubSub, or some other, popular technique,
it is possible to compete there with Corporations and P2P crawling -
started with,
whatever - let's say month  gap of collected data - comparing to
structure that Corporations
have built - that P2P or other technique-based - user based
distributed  Crawling(with old Polling)
could eventually - finally - replace what Google or etc. doing in this
area, and that could be proven
by facts.

Seeks Project as "the king" of all meta-search ideas among i'v seen -
have a long-term goal to eventually become a crawling plug-in for
existent servers,
but, looks like,  they - don't believe in this - themselves. Maybe
FBox could inspire them, maybe FBox could - just use the Push for it's
networks, whatever - FreedomBox Foundation - given the stated ideology
- are responsible for Leaving the Cloud also.

-

> but more importantly, we will need to debate a topic that has been bounced around in other similar projects;  a search engine that will not only index our > user published privacy/anonymous aware sites and material, as well as tor/i2p/freenet sites. The reason why i bring this up is, there was a big debate is > some of the larger anonymous/privacy concerned projects, including TOR about having an internal search indexing service.. and the general consensus > was that the ability for sites to not be found on a search engine is a benefit, not a flaw. Not just the idea, but the request that some sites, their users, >and content providers of these sites dont wont to be found, don't want to be indexed, and dont want to be identified.

In our Wave-alike networks that are inspired with Robots/Gadget scheme
from Google Wave,
the idea is - you - as an admin of Wave or as a trusted participant -
could add the "Crawly" Robot
to the discussion and it'll crawl, push-given-collect the resource -
only if you wish so.
What you'r talking could be the matter of robots.txt
I need to say - that robots.txt and http://Schema.org - that modern
Search Corporations want from you,
is also - constantly changing and Is the theme for debates -- could be
reinvented/improved.

What i really need to remind for the readers - the Google itself are
less than 13 years on real market.
In my opinion - 13 years based on Good Promises and Not so Adequate
Realization (from what you could
read in - http://ilpubs.stanford.edu:8090/361/ -
http://infolab.stanford.edu/~backrub/google.html)
doesn't give any Astonishing Advantage over what FLOSS community and
scientific world - being able now.