Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse if css selector is non-unique #4

Open
skanga opened this issue Jun 14, 2016 · 7 comments
Open

How to parse if css selector is non-unique #4

skanga opened this issue Jun 14, 2016 · 7 comments

Comments

@skanga
Copy link

skanga commented Jun 14, 2016

Consider this HTML snippet

<p class="attrgroup"><span><b>2001 KAWASAKI DRIFTER</b></span></p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br><span>engine displacement (CC): <b>800</b></span>
<br><span>fuel: <b>gas</b></span>
<br><span>odometer: <b>17500</b></span>
<br><span>paint color: <b>red</b></span>
<br><span>title status: <b>clean</b></span>
<br><span>transmission: <b>manual</b></span><br>
</p>

In this case how do you extract "columns" for fuel, odometer, transmission, etc especially in the case when a) the order may be different and b) some fields may be missing.

Note: this snippet is taken from https://sfbay.craigslist.org/nby/mcy/5623911440.html

@breeve1
Copy link
Contributor

breeve1 commented Jun 14, 2016

The below works. If the nth-child is different for some reason, I would do multiple selects into a table buffer. One might be empty but the other won't. Case statements are also supported in the select clause.

select
    pick ''
from download page 'https://sfbay.craigslist.org/nby/mcy/5623911440.html'
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'

@breeve1
Copy link
Contributor

breeve1 commented Jun 14, 2016

The below might get you started. I wrote this code to go through the pages and get all the details for a motorcycle.

create buffer totalCount(count int)

insert into totalCount
select 
    pick '.totalcount'
from download page 'https://sfbay.craigslist.org/search/nby/mcy'

create buffer pageUrls(url string)

insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy'

each(var c in totalCount) {

var pagesCounts = c.count / 100

insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy?s=' + value
from expand (1 to pagesCounts){$ * 100}
}

create buffer detailsUrls(url string)

insert into detailsUrls
select
    'https://sfbay.craigslist.org' + pick 'a.hdrlnk' take attribute 'href'
from download page (select url from pageUrls) with (thread(5))
where nodes = '.content .rows p'

var detailDownloads = download page (select url from detailsUrls) with (thread(10)) --download all pages

create buffer motorCycle(url string, title string)
insert into motorCycle
select
    url, 
    pick '#titletextonly'
from detailDownloads

create buffer motorCycleDetails(url string, metric string) 

insert into motorCycleDetails
select
    url,
    pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'

insert into motorCycleDetails
select
    url,
    pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(2) span'

select m.url, title, metric
from motorCycle m
join motorCycleDetails d on d.url = m.url

@VitoVan
Copy link

VitoVan commented Jun 14, 2016

Does this nth-child way solve the

a) the order may be different and
b) some fields may be missing.

problems?

Seems not so convincible.

@breeve1
Copy link
Contributor

breeve1 commented Jun 14, 2016

So in this case I believe it does. Since we are selecting all the spans and if some metrics are missing they won't be in the spans. We are also selecting the full text so each row would be:

engine displacement (CC): 800
fuel: gas
odometer: 17500
paint color: red
title status: clean

So it is category:value pairs

@VitoVan
Copy link

VitoVan commented Jun 14, 2016

Thank you for the explanation, I just found a windows machine and executed your script,
result
It works just right, but... Can we get something like this:

url engine fuel odometer paint color
http://www.example.com/moto?id=1 800 gas 17500 red
http://www.example.com/moto?id=2 800 gas NULL blue
http://www.example.com/moto?id=3 800 gas 20500 NULL

@breeve1
Copy link
Contributor

breeve1 commented Jun 14, 2016

You can run on the mac with mono. You have to download the command line.

What about the below at the end of the script. I don't have group by implemented but once in SQL you can do a group by and do a min or max on all the columns to flatten them out. I have regular expressions implemented but right now they only work on pick statements but that would be easy to add so it could cleanup the text inside the case statements to eliminated the value pair once that is done.

create buffer final(url string, motorCycle string, condition string, engine string, fuel string, odometer string, paint string, title string, transission string)

insert into final
select
    m.url,
    title,
    case when metric like '%condi' then metric else '' end,
    case when metric like '%engine' then metric else '' end,
    case when metric like '%fuel' then metric else '' end,
    case when metric like '%odomet' then metric else '' end,
    case when metric like '%paint' then metric else '' end,
    case when metric like '%title' then metric else '' end,
    case when metric like '%trans' then metric else '' end
from motorCycle m
join motorCycleDetails d on d.url = m.url

select *
from final

@VitoVan
Copy link

VitoVan commented Jun 15, 2016

@breeve1 This will work, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants