[UPHPU] Web site scraping

Richard K Miller richardkmiller at gmail.com
Thu Sep 25 14:43:38 MDT 2008


Oops, I just noticed that the link Alvaro sent refers to the same  
SimpleTest (not SimpleUnit) framework that I mentioned. Well not  
exactly, but it uses the same base code. The owner of lastcraft.com is  
the creator of SimpleTest. My bad.

Richard





On Sep 25, 2008, at 2:40 PM, Richard K Miller wrote:

> In the past I've used regular expressions, but after hearing Alvaro  
> mention tidy+xpath at a UPHPU meeting, I started using that. I've  
> loved it. SimpleXML is easy to use. I haven't ventured into XSLT,  
> like Ray suggested, but tidy+xpath has been great.
>
> On a similar note, I've been looking at SimpleUnit's Web Testing  
> module and it seems pretty powerful. You can use it for far more  
> than unit testing. It's like a scriptable browser, in which you can  
> "click" links, fill out forms, work with cookies, etc. The example  
> on the website shows how to perform an automated Google search:
>
> http://www.simpletest.org/en/start-testing.html#web
>
> Richard
>
>
>
> On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:
>
>> I forgot one thing: Scriptable Browser.
>> http://www.lastcraft.com/browser_documentation.php
>>
>> This makes it really easy to deal with forms, authentication,  
>> clicking
>> on links, etc.
>>
>> Seriously, the combination of scriptable browser, tidy, and xpath  
>> makes
>> scraping a piece of cake.
>>
>> Alvaro
>>
>> Alvaro Carrasco wrote:
>>> In my experience, the easiest way is: run website through tidy,  
>>> load it
>>> into a DOMDocument, and use xpath.
>>>
>>> The xpath patterns are SO much easier to read and write than regex  
>>> and
>>> more resistant to changes to the website (if you write them  
>>> correctly).
>>> You can also use regex within xpath if you ever need it.
>>>
>>>



More information about the UPHPU mailing list