To get a trial key
fill out the form below
Team License (a basic version)
Enterprise License (extended version)
* By clicking this button you agree to our Privacy Policy statement

Request our prices
New License
License Renewal
--Select currency--
USD
EUR
GBP
RUB
* By clicking this button you agree to our Privacy Policy statement

Free PVS-Studio license for Microsoft MVP specialists
* By clicking this button you agree to our Privacy Policy statement

To get the licence for your open-source project, please fill out this form
* By clicking this button you agree to our Privacy Policy statement

I am interested to try it on the platforms:
* By clicking this button you agree to our Privacy Policy statement

Message submitted.

Your message has been sent. We will email you at


If you haven't received our response, please do the following:
check your Spam/Junk folder and click the "Not Spam" button for our message.
This way, you won't miss messages from our team in the future.

>
>
"Improve your... Google?"

"Improve your... Google?"

Feb 11 2010

While developing the code analyzer PVS-Studio intended for searching issues in 64-bit and concurrent software, we came to the need of collecting fresh information on the Internet on some topics. For example, it is always useful to answer the questions of programmers who may be interested in our tool on various forums and blogs. While collecting the data we found out that there is much information on the Internet and therefore manual search might be very long and tiresome. Thus the task of automating the process of searching for fresh data appeared. In this post we will tell you how we do this.

But I bet you have said right now: "Ha-ha! The guys are reinventing the wheel and are not aware of Google Alerts". Well, we are aware of Google Alerts. And it is almost the thing we need but not quite :-). We have been using Google Alerts for more than half a year and did not manage to get what we needed. And here is what we need:

  • search on some particular listed sites;
  • search over the period of only the last twenty-four hours;
  • capability to add stop words;
  • Google Alerts provides some mechanism of additional filtration of results. I.e. the usual Google search is more helpful than Google Alerts.

That is why we decided to reinvent the wheel.

Within the scope of this task we need to implement the search of new materials on particular sites - up to 30 titles and created not earlier than 24 hours before launching the automated search. I.e., roughly speaking, we need to know what people have written on the Internet for the last day. The input data will be the following:

  • The list of site addresses - url's of sites involved into the search.
  • The list of key phrases - the phrases in Russian and/or English the search relies on.
  • The list of undesirable words - the words that must be excluded from the search results.

The idea

There are a lot of sources on the Internet that offer their services of search and it is reasonable to use their capabilities to solve our task. We have chosen google.com because it is, in our opinion, the most suitable one.

Search in google

The working principle of Google is the same as of any other search engine: you type a request for Google and it gives you the answer. The search engine has flexible settings to make it easier to form the necessary request.

Search parameters in google

Let us look at the most interesting (within the scope of our task) search parameters:

http://www.google.com/search?

The address itself

as_q

The key phrase (it is the phrase - not a line of words)

Num

The number of results to be shown on the page

as_eq

The words that must be excluded from the search results

as_sitesearch

url of the site involved into the search

The search engine has some other parameters but in our case they are irrelevant. Here is an example of a request in Google with the search parameters:

http://www.google.com/search?as_q=64-bit+portability+&
hl=ru&newwindow=1&
num=30&btnG=%D0%9F%D0%BE%D0%
B8%D1%81%D0%BA+%D0%B2+Google&
as_epq=&as_oq=&
as_eq=%D0%BA%D1%83%D0%BF%D0%B8%D1%82%
D1%8C+%D1%81%D0%BA%D0%B0%D1%87%D0%
B0%D1%82%D1%8C+&
lr=lang_ru&cr=&as_ft=i&as_filetype=&
as_qdr=d&as_occt=any&
as_dt=i&as_sitesearch=http://www.codeguru.com/&
as_rights=&safe=images

How we can use it

The conclusion from all said above is that we can automate the search process using the Google search. The algorithm is as follows:

  • We form a request to Google relying on the input data.
  • The request is processed.
  • The result is processed (parsing of the html-page).
  • All the previous steps are repeated for every site and every key phrase of the input data.

Implementation

The script is written in php.

Input data

There are three types of the input data. These are the list of the url's of the sites to perform the search on, the list of key phrases and the list of the words that must be excluded from the search results. To present these data we use an xml file like this:

<?xml version="1.0" encoding="utf-8"?>
<search_params lang="ru">
        <sites>
                <url>http://www.dreamincode.net</url>
                <url>http://forum.vingrad.ru/</url>
                <url>http://forum.sources.ru/</url>
                <url>http://groups.google.com/</url>
        </sites>
        <words>
                <white_list>
                        <phrase>"64-bit" c++</phrase>
                        <phrase>64-bit migration</phrase>
                        <phrase>viva64</phrase>
                </white_list>
                <black_list>
                        <phrase>buy</phrase>
                        <phrase>download</phrase>
                </black_list>
        </words>
</search_params>

XML parsing

An XML file has a simple structure and small size, so you may use the script PHP Simple HTML DOM Parser.

It is described in the documentation how to use the script but we should notice that the techniques of using it with DOM are very similar to how JQuery, a popular javascript library, does it. For example, the following code receives all the links from the html page by the address google.com and prints them on the screen:

include('../simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.google.com/');
// find all link
foreach($html->find('a') as $e)
    echo $e->href . '<br>';

But there is one issue related to memory occurring when Simple HTML DOM Parser works. The point is that the function file_get_html creates a new object of simple_html_dom class at every call and if you use this function in a loop, memory runs out. Due to an unknown reason we cannot force it to free. So the most reasonable solution is not to use this function in a loop and call it once and work only with one object of the class simple_html_dom.

Script creation

Actually there is nothing interesting about it - it is a common script in php written using MVC pattern. The source code is also simple.

The user interface is very simplified: when you address a page, you see only one button "Send Request" (in the browser window) and after you click it, the result appears on the screen in a couple of seconds.

Conclusion

Having introduced this script, we can now always see what has happened in the world in our scope (64-bit and parallel programming) for the last twenty-four hours.

Popular related articles
How PVS-Studio Proved to Be More Attentive Than Three and a Half Programmers

Date: Oct 22 2018

Author: Andrey Karpov

Just like other static analyzers, PVS-Studio often produces false positives. What you are about to read is a short story where I'll tell you how PVS-Studio proved, just one more time, to be more atte…
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, 10-15% of False Positives

Date: Jul 31 2017

Author: Andrey Karpov

After I wrote quite a big article about the analysis of the Tizen OS code, I received a large number of questions concerning the percentage of false positives and the density of errors (how many erro…
The way static analyzers fight against false positives, and why they do it

Date: Mar 20 2017

Author: Andrey Karpov

In my previous article I wrote that I don't like the approach of evaluating the efficiency of static analyzers with the help of synthetic tests. In that article, I give the example of a code fragment…
Appreciate Static Code Analysis!

Date: Oct 16 2017

Author: Andrey Karpov

I am really astonished by the capabilities of static code analysis even though I am one of the developers of PVS-Studio analyzer myself. The tool surprised me the other day as it turned out to be sma…
Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities

Date: Nov 21 2018

Author: Andrey Karpov

A brief description of technologies used in the PVS-Studio tool, which let us effectively detect a large number of error patterns and potential vulnerabilities. The article describes the implementati…
The Evil within the Comparison Functions

Date: May 19 2017

Author: Andrey Karpov

Perhaps, readers remember my article titled "Last line effect". It describes a pattern I've once noticed: in most cases programmers make an error in the last line of similar text blocks. Now I want t…
The Last Line Effect

Date: May 31 2014

Author: Andrey Karpov

I have studied many errors caused by the use of the Copy-Paste method, and can assure you that programmers most often tend to make mistakes in the last fragment of a homogeneous code block. I have ne…
PVS-Studio ROI

Date: Jan 30 2019

Author: Andrey Karpov

Occasionally, we're asked a question, what monetary value the company will receive from using PVS-Studio. We decided to draw up a response in the form of an article and provide tables, which will sho…
PVS-Studio for Java

Date: Jan 17 2019

Author: Andrey Karpov

In the seventh version of the PVS-Studio static analyzer, we added support of the Java language. It's time for a brief story of how we've started making support of the Java language, how far we've co…
Static analysis as part of the development process in Unreal Engine

Date: Jun 27 2017

Author: Andrey Karpov

Unreal Engine continues to develop as new code is added and previously written code is changed. What is the inevitable consequence of ongoing development in a project? The emergence of new bugs in th…

Comments (0)

Next comments
This website uses cookies and other technology to provide you a more personalized experience. By continuing the view of our web-pages you accept the terms of using these files. If you don't want your personal data to be processed, please, leave this site.
Learn More →
Accept