[ Date Index ][
Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]
On Friday 04 Jul 2003 4:26 pm, Jonathan Melhuish wrote: > On Wednesday 02 July 2003 23:50, Neil Williams wrote: > > Technically wrong? I'd say it was stretching the rules because: > > 1. It uses a non-existent filesystem: It's pretending (if you read the > > URL strictly) that there are 8 sub-directories below the .biz domain > > whereas none probably exist with the names specified (with or without the > > = ). > > Yeah, if you wish to interpret the "/" delimiter as 'directories' then it > does; but I don't see that that should pose a problem. The URL is I just highlighted it as a possible problem with external parsers - you mentioned you thought the URL was causing problems for Google. As far as scripts go, if you are able to isolate the search query from the domain URL then you could replace all / with say # before creating the relative links. You could change back later. Perl makes this easy, so perhaps you need a sequence of scripts - one Perl, one sed and back with a reverse of the first Perl script - you could always wrap the scripts into a bash script to leave one command. (Perl can do things like this on the command line too.) > > 2. It uses non-standard repetition: It's imitating a query string and > > then adding a real one (the xx=xx would appear to some form of > > variable=value statement) - repetition that is likely to cause many a > > parser to barph. > > Not really. You interpreted the bit with slashes in above as a file > location, and that seems like a fair enough conclusion. Are you telling me > that "=" isn't a valid character for a filename? I had suspected that > myself, but I can't find any evidence to support it. Neither could I, so although I suspect it, I didn't want to state it. The way I see it is that [a-zA-Z0-9]=[a-zA-Z0-9] is a regular expression match for the usual query string variable=value format, as you conclude. In my devil's advocate hat, I would expect problems when an external parser (whether Google or a script) that knows nothing about your server configuration comes across a URL that, for fair enough reasons, would appear to include TWO query strings - or at best one badly formed query string (missing the ?) and one correct one. I would expect many standard compliant programs to barph at the first and possibly miss the second. > > 3. Required filedata is absent: There's no 'real' file anywhere for > > processes like Google to grab onto - I'd presume there's some index.php > > default.asp or similar behind it but it's not stated and therefore must > > be assumed, which is often a bad tactic. > > No there isn't an "index.php", the page is dynamically generated by a Perl > server engine and passed via the "sms.ic" linker program. No-one "assumes" > it's presence, nor can I see it's relevance. You got to the URL, you get > the page. Are there any static pages? Index page(s)? Catalogues? Dynamic pages that have a static URL? URL's that look like a search result aren't going to attract attention from Google users when displayed inside Google's own search results. I get the feeling from engines like Google that URL's that look like search results don't show up as favourably as URL's that look like a static page, that's all. It's more useful to me (as a Google user) to find a static page (perhaps a department catalogue etc.) in the search results at Google and then proceed from there within your site. A URL that already contains a search string may look, to Google or to Google users, as "someone else's search" and perhaps that's why the pages aren't being indexed. A few static URL's may well be all that Google needs. > > Stretching the letter of the 'rules' but breaking the spirit? Personally, > > I wouldn't like to use an engine that relied on this type of persistence. > > > > I'm not surprised that it doesn't parse well with processes like Google. > > > <groans> I knew I shouldn't have told you the URL ;-) :-) > quite a bit, isn't exactly great. I had mistakenly assumed that any code I > used that a 'pro' had written would be clean and standards compliant :-( Ouch. There really isn't much in a name, especially one as over-used and over-played as Professional Edition or Pro Edition. > > It would take some time to bring that page to the intended HTML4 > > Transitional standard proclaimed at the top of the page returned from > > that > > You're damn right... > > The "?id=" bit does indeed store a unique customer number, the rest is > stored in the database. The rest of the URL is just the search string. I > don't see the problem with this approach. Not the approach, just the way the URL uses / when other stores use 'formal' query strings or session ID strings. The problem you have already seen - creating relative links and Google non-indexing. From the website design and programming viewpoint, using / just seems to be asking for all sorts of horrible bugs and errors. It's one of those nasties that just jumps out and shouts "I WILL BITE!" at me. I just know that as soon as I try and do something non-standard with it, it'll be right there in my face like a neon No Entry sign. I know it's easy in Perl to not use / in search patterns, but it's obvious that using / is only going to cause any other scripting language to descend into chaos. I immediately dislike any programming trait that locks me into one particular way of problem-solving - whether the trait is present through design or negligence. Hence: > > Is there a different engine available for the job? > > It's actually something I've been considering quite carefully, especially > after having such serious performance problems. The Interchange user group > generally maintain that the performance is "satisfactory", so long as your > hardware is up to it. Which perhaps it is, but frankly new hardware is not > an option at the moment, so I'm stuck with a 300Mhz Celeron that's just > recently been downgraded to 128Mb. > > Mind you, I bet you Apache/MySQL could serve a few pages per second off > even that lowly spec, so I don't see why there should be any excuse for > such lame performance (<1 request/sec). And this is a Pro product!! hehe. Sorry. > OSCommerce in particular looks quite promising, I would be interested to > hear if anybody has any experience with it. It will definately be a > serious contender if I develop another online store, but I'm not sure if I > can justify the time and expense of completely ditching Interchange and the > current SMS product database at this late stage. But it's certainly > tempting... > > Jon After hearing so many horror stories of e-commerce tools, I won't be looking to gain any experience of them anytime soon!! OK. Enough protests. Let's get to the chase. From your first email: Eg. you run a search and get sent to this location: http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se= OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr Where there is a relative link to "./index.html", but of course that now translates to: http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se= OtherReceivers/va=banner_image=/index.html I've forced a line break to make it readable. The relative link you want is: http://www.smssat.biz/index.html Yes? The trouble is that it has to have the "sms.ic" bit when wget spiders it, so that it gets the live (dynamic) version, but NOT have the "sms.ic" in the mirrored (static) version. By "sms.ic" do you mean the search string or the actual characters? The actual characters are easy: (Probably what Kai was referring to when he basically said RTFM.) $ cat test.pl #!/usr/bin/perl use strict; my $url = "http://www.smssat.biz/sms.ic/index.html"; print "old url: $url\n"; $url =~ s/sms\.ic\///g; print "new Url: $url\n"; $ perl test.pl old url: http://www.smssat.biz/sms.ic/index.html new Url: http://www.smssat.biz/index.html To solve the first problem - getting to http://www.smssat.biz/index.html from the relative link ./index.html : #!/usr/bin/perl use strict; ################# Variable List ############### my $url; # the search results + query string my $match; # the search results split off from the domain my @matches; # array of each search result element my $content; # holds each member of @matches in turn. my $c; # counter ############### End variable list ############## $url = "http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se= OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr"; print "old url: $url\n"; $url =~ s$http://www\.smssat\.biz/(.*)$http://www.smssat.biz/$g; print "new Url: $url\n"; $match = $1; print "match $match\n"; $c=0; @matches = split /\//,$match; foreach $content (@matches) { $c++; print "Content $c: $content\n"; } Again, a line ending has been forced that isn't in the script. (Same place in each case.) Note the use of the $ delimiter for the first match - it saves escaping all the /. That's what I meant by lock-in - it's something that Perl can do easily but which would cause problems in other scripting langauges. The split function needs the / itself so the / in the pattern is ecaped: \/. The . wildcard in the first match needs to be escaped to stop Perl thinking that the . can be replaced by any other character and still match. Output: perl test.pl old url: http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se= OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr new Url: http://www.smssat.biz/ match scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se= OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr Content 1: scan Content 2: fi=products Content 3: sp=results_big_thumb Content 4: st=db Content 5: co=yes Content 6: sf=category Content 7: se=OtherReceivers Content 8: va=banner_image= Content 9: va=banner_text=.html?id=f8YyQGtr Is that in the right direction? I'm sure you can process each content value as appropriate from here and create the new static URL by a similar process. You'd then just call the perl script as part of the copy process - in a pipe. The output of the live site would be piped into the input of the perl script which would transform it and output suitable static links to whatever process you want to use to write the output to files. (Perl could do the whole thing for you). -- Neil Williams ============= http://www.codehelp.co.uk http://www.dclug.org.uk http://www.wewantbroadband.co.uk/
Attachment:
pgp00019.pgp
Description: signature