[UPHPU] Web site scraping

Richard K Miller richardkmiller at gmail.com
Thu Sep 25 14:40:52 MDT 2008


In the past I've used regular expressions, but after hearing Alvaro  
mention tidy+xpath at a UPHPU meeting, I started using that. I've  
loved it. SimpleXML is easy to use. I haven't ventured into XSLT, like  
Ray suggested, but tidy+xpath has been great.

On a similar note, I've been looking at SimpleUnit's Web Testing  
module and it seems pretty powerful. You can use it for far more than  
unit testing. It's like a scriptable browser, in which you can "click"  
links, fill out forms, work with cookies, etc. The example on the  
website shows how to perform an automated Google search:

http://www.simpletest.org/en/start-testing.html#web

Richard



On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:

> I forgot one thing: Scriptable Browser.
> http://www.lastcraft.com/browser_documentation.php
>
> This makes it really easy to deal with forms, authentication, clicking
> on links, etc.
>
> Seriously, the combination of scriptable browser, tidy, and xpath  
> makes
> scraping a piece of cake.
>
> Alvaro
>
> Alvaro Carrasco wrote:
>> In my experience, the easiest way is: run website through tidy,  
>> load it
>> into a DOMDocument, and use xpath.
>>
>> The xpath patterns are SO much easier to read and write than regex  
>> and
>> more resistant to changes to the website (if you write them  
>> correctly).
>> You can also use regex within xpath if you ever need it.
>>
>>


More information about the UPHPU mailing list