[UPHPU] Web site scraping

thebigdog bigdog at venticon.com
Thu Sep 25 10:21:31 MDT 2008


Nathan Lane wrote:
> I want to make what in effect is a website scraper using PHP, but it isn't
> obvious how this would best be done. I've tried using DOMDocument and I'm
> not sure if that's the best option or not. I'd really like to use something
> where I could use XPath to get the elements out that I want. Recently I
> wrote a similar program in C# that I call HttpAnalyzer. Could I just use
> that with PHP (i.e. call it from PHP) to get what I'm looking for? Any
> suggestions?

i would agree with alvaro and walt. You could actually combine the 2
suggestions...I have done the following:

1. download the page
2. run the page through tidy (cleanup tags)
3. applied xslt transform with dom
4. retrieve the results

This has worked really well in terms of speed and the amount of data that I have
used. xslt can contain logic which is really nice. by using xslt i can create
various transformation providing greater flexibility and customization and i can
still use all the xml technologies like xpath.


-- 
thebigdog


More information about the UPHPU mailing list