Thursday, January 9, 2014

Seek, and You Shall--Aggregate the Results



I search the Internet (who doesn't in the 21st century?--rhetorical question) using the many search engines to find information. An Internet search is not necessarily "one size fits all" with a search engine. But it is tedious and tiresome to search on a variety of search engines, or even some search engines that search using other search engines. I had the goal to search more efficiently, do more with less.

Project



This project has the goal of more efficient search--search using other search engines but at a single point--a locus or nexus. The operation is query many a search engine, accumulate and aggregate the results. Thus Search Engine QUery-Results AccuMulator Aggregator or SEQu-RAmA, or "Seek you rama"--remember the Big Three automaker "Futurama" of autos, or the cartoon with Phillip J. Fry?

The search engine is a simple web server using Java (for the different computers I have), that runs on the computer.

The web server always prompts for input using an HTML5 web page using JavaScript for some simple checks (no empty search) and functions (clear input).


To use the search engine, I just configured a browser to access it by default as "http://localhost:PORT" when the web browser is run. The web server uses the identity of the web browser when it connects--in effect, the web browser is a proxy for the browser, an extension in a web server.

Implementation



Each search engine has particular idiosyncrasies hence each uses an abstract class that is sub-classed for the particular search engine. Each search engine has its own sub-class, that also implements threading using Runnable. Search results are returned in "public String[] results" and a "public boolean readyFlag = false".

Once search parameters are submitted, the web server creates a series of threads for search in parallel as an array "SearchEngine[] engine" of the threads. Synchronization is unnecessary, as the web server checked each array, so that "engine[3].readyFlag" is indicative the search is complete. Each search always stores the results in the "String[] results" and then sets the flag. Hence if the web server accesses the readyFlag, and it is false while the thread attempts to update it to true, no problem as the web server will check it again.

Results



Each search engine has a default of returning an array of 0-results, and setting readyFlag to true if a connection times out, or some other error--a log file is kept for a web search.

When "engine[3].readyFlag" is true, the web server reads "engine[3].results" using a loop from 0 to "engine[3].results.length", inserting them into an index structure that is a set--no duplicates.

Once all the searches are complete, and the results in the index structure, the results are processed looking for the kind of result by type--.html, .xml, .pdf, .jpeg, .asp, etcetera. The extension extraction algorithm extracts the extension, then inserts the result into a ordered map-set...each extension has a set containing all results by extension. There is no predefined extensions, they are created dynamically for each extension. This has the quirk that .htm and .html have own sets, and sometimes one extension might have one result.

After all links are stored in the ordered map-set, the results are presented in a predefined format of HTML5, but with embedded links to allow quickly accessing results by type. The actual links are presented in a table, with each type having a sub-table. The output format is functional and navigable, but lacks the preview or other main search engines.

Kickin' Off



I have the SEQu-RAmA on my Mac, Linux, Windows computers. I've created a "kicker" program--a simple application that first starts the web server, delays for 3-seconds, and then fires the web browser. The fun or hard part was to create a native kicker using g++ on my Mac, gcc on my Linux, and then C# on the Windows machine. The icon is pretty basic on the desktop, but it works as the browser automatically connects to the web server. Click the icon, connect to web server, and search in parallel for data.

Future SEQu-RAmA



Of course this is the initial prototype, so some improvements, tweaks, and alterations will be made over time. For example, configure exclusion so that types of results like images, videos, sound are excluded.

Another area of possibility is to use the many results to determine the significance of links among the various search engines--a weight for each result. A repeated result for each search engine increases the weight--for M out of N search engines, and results of a type are then sorted accordingly.

Variations on input are possible, such as search on each parameter, and then create permutations of the parameter. Consider a search "Richard Nixon Watergate" and then search "Richard", "Nixon", "Watergate" and then "Richard Nixon", "Richard Watergate"--and then determine for each result from all search engines the weight of each result.

One wild possibility is a "deep search" where the initial results for an extension are used for a more deeper search--extract datum from the results (not possible for images/sounds/videos) and revise the query. Such a deeper search would require an existing query, but with a button or link to do so, and wait for the results requiring actual fetch of web data.

Nothing New



An acquaintance sneered at the SEQu-RAmA project/server, saying that "XYZ search engine works for me." I simply pointed out if they're that happy why be so disdainful, and I use the "XYZ search engine" in a multiple parallel search. I recently read an article from Slate by Jessica Olien, Inside the Box: People Don't Actually Like Creativity about how creativity and innovation are supposedly prized, but actually despised. No big surprise to me, asking "why" with the implication of wasted effort/time illustrates lack of imagination, dumbing down thought.

But I will continue tweaking, improving, and revising SEQu-RAmA for my own use. It might be possible as a tablet app, or even a browser plug-in, although I prefer a separate web server as a proxy for the web browser. Seek and you shall accumulate, aggregate, and then find.