-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to parse if css selector is non-unique #4
Comments
The below works. If the nth-child is different for some reason, I would do multiple selects into a table buffer. One might be empty but the other won't. Case statements are also supported in the select clause. select
pick ''
from download page 'https://sfbay.craigslist.org/nby/mcy/5623911440.html'
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span' |
The below might get you started. I wrote this code to go through the pages and get all the details for a motorcycle. create buffer totalCount(count int)
insert into totalCount
select
pick '.totalcount'
from download page 'https://sfbay.craigslist.org/search/nby/mcy'
create buffer pageUrls(url string)
insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy'
each(var c in totalCount) {
var pagesCounts = c.count / 100
insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy?s=' + value
from expand (1 to pagesCounts){$ * 100}
}
create buffer detailsUrls(url string)
insert into detailsUrls
select
'https://sfbay.craigslist.org' + pick 'a.hdrlnk' take attribute 'href'
from download page (select url from pageUrls) with (thread(5))
where nodes = '.content .rows p'
var detailDownloads = download page (select url from detailsUrls) with (thread(10)) --download all pages
create buffer motorCycle(url string, title string)
insert into motorCycle
select
url,
pick '#titletextonly'
from detailDownloads
create buffer motorCycleDetails(url string, metric string)
insert into motorCycleDetails
select
url,
pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'
insert into motorCycleDetails
select
url,
pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(2) span'
select m.url, title, metric
from motorCycle m
join motorCycleDetails d on d.url = m.url |
Does this
problems? Seems not so convincible. |
So in this case I believe it does. Since we are selecting all the spans and if some metrics are missing they won't be in the spans. We are also selecting the full text so each row would be: engine displacement (CC): 800 So it is category:value pairs |
Thank you for the explanation, I just found a windows machine and executed your script,
|
You can run on the mac with mono. You have to download the command line. What about the below at the end of the script. I don't have group by implemented but once in SQL you can do a group by and do a min or max on all the columns to flatten them out. I have regular expressions implemented but right now they only work on pick statements but that would be easy to add so it could cleanup the text inside the case statements to eliminated the value pair once that is done. create buffer final(url string, motorCycle string, condition string, engine string, fuel string, odometer string, paint string, title string, transission string)
insert into final
select
m.url,
title,
case when metric like '%condi' then metric else '' end,
case when metric like '%engine' then metric else '' end,
case when metric like '%fuel' then metric else '' end,
case when metric like '%odomet' then metric else '' end,
case when metric like '%paint' then metric else '' end,
case when metric like '%title' then metric else '' end,
case when metric like '%trans' then metric else '' end
from motorCycle m
join motorCycleDetails d on d.url = m.url
select *
from final |
@breeve1 This will work, thank you. |
Consider this HTML snippet
In this case how do you extract "columns" for fuel, odometer, transmission, etc especially in the case when a) the order may be different and b) some fields may be missing.
Note: this snippet is taken from https://sfbay.craigslist.org/nby/mcy/5623911440.html
The text was updated successfully, but these errors were encountered: