[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]
I've done a bit of website "screen scraping". It can be difficult, depending on cookies, _javascript_ etc. But for simple sites you can parse and traverse them quite quickly. I developed a set of tools for fetching train ticket prices to allow you to break a journey down into single stages. You can often save 40% of the normal price using this technique. The tools we used were python, urllib and url2lib, the HTML parser was BeautifulSoup, which is really easy to use. http://www.crummy.com/software/BeautifulSoup/ Combined with the elementtree XML library, gives ElementSoup http://effbot.org/zone/element-soup.htm For simple sites, these tools, and perhaps a bit of regex handling, will give you everything you want. But you will have to code it instead of training it. D On Sunday 16 November 2008 17:21:03 Chronoppolis wrote: > Hello, > > This may be my very first post (i dont remember - but yay for me). I have > not got my linux projects to the point where i can ask concise questions ad > so have just enjoyed the emails as a source of great interest. at some > point i will post the various projects i am pursuing and issues i face but > not today. > > This last post of yours tom particularly caught my eye as i have a very > complicated project and a spider that would hunt through various > supermarkets websites for me would be Unbelievably helpful - i would > certainly be very interested in any further information you have about this > or how one would go about it. > > I am a newbie programmer and am self teaching myself with a couple of > friends as mentors, so this will be a very newbie question. What are the > components necissary to create a spider program? is it something that has > to be made for each site individually? if the website in question updates > will this stop the spider from working? i have other questions but those > are the basic ones > > Dan > > On Sun, Nov 16, 2008 at 9:17 AM, Tom Potts <tompotts@xxxxxxxxxxxxxxxxxxxx>wrote: > > I've just been playin with Audiveris which is a well cool (showing my age > > her) > > Java app that takes a sheet music image and converts it to Midi or > > musicxml so someone like me who cant seem to learn to read sheet music > > can play scores. > > There are quite a few archives out there with out of copyright material > > available and I'd like to try converting a lot to MusicXML. > > I'd like to automate the downloading of the images but get rid of the > > detritus. > > I want a trainable spider that I can show the 'root' page of the > > collection, > > click on a table or ddl and set that as the repeat action, then go down > > to another level and get to (say composer) level, make a local directory, > > then click to a song, make a local directory, drill down and get the > > associated image(s), return to composer get next song, , back to root get > > next part of collection....... > > It occured to me something like this might also be useful for pulling > > prices > > from supermarket web sites for a comparison site as they seem to change > > there > > arrangments to try and make this difficult - 'Competition? We love it we > > just > > do everything we can to stop it...' > > > > Tom te tom te tom > > > > > > -- > > The Mailing List for the Devon & Cornwall LUG > > http://mailman.dclug.org.uk/listinfo/list > > FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html |
-- The Mailing List for the Devon & Cornwall LUG http://mailman.dclug.org.uk/listinfo/list FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html