[UPHPU] Web site scraping
Richard K Miller
richardkmiller at gmail.com
Thu Sep 25 14:43:38 MDT 2008
Oops, I just noticed that the link Alvaro sent refers to the same
SimpleTest (not SimpleUnit) framework that I mentioned. Well not
exactly, but it uses the same base code. The owner of lastcraft.com is
the creator of SimpleTest. My bad.
Richard
On Sep 25, 2008, at 2:40 PM, Richard K Miller wrote:
> In the past I've used regular expressions, but after hearing Alvaro
> mention tidy+xpath at a UPHPU meeting, I started using that. I've
> loved it. SimpleXML is easy to use. I haven't ventured into XSLT,
> like Ray suggested, but tidy+xpath has been great.
>
> On a similar note, I've been looking at SimpleUnit's Web Testing
> module and it seems pretty powerful. You can use it for far more
> than unit testing. It's like a scriptable browser, in which you can
> "click" links, fill out forms, work with cookies, etc. The example
> on the website shows how to perform an automated Google search:
>
> http://www.simpletest.org/en/start-testing.html#web
>
> Richard
>
>
>
> On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:
>
>> I forgot one thing: Scriptable Browser.
>> http://www.lastcraft.com/browser_documentation.php
>>
>> This makes it really easy to deal with forms, authentication,
>> clicking
>> on links, etc.
>>
>> Seriously, the combination of scriptable browser, tidy, and xpath
>> makes
>> scraping a piece of cake.
>>
>> Alvaro
>>
>> Alvaro Carrasco wrote:
>>> In my experience, the easiest way is: run website through tidy,
>>> load it
>>> into a DOMDocument, and use xpath.
>>>
>>> The xpath patterns are SO much easier to read and write than regex
>>> and
>>> more resistant to changes to the website (if you write them
>>> correctly).
>>> You can also use regex within xpath if you ever need it.
>>>
>>>
More information about the UPHPU
mailing list