-
Notifications
You must be signed in to change notification settings - Fork 240
More about parsing
In previous chapter we only took a glance at the
parse_doc
method, here we take a deeper look at the Scanner object
which is passed to the parse_doc
method.
The Scanner is similar to Ruby builtin StringScanner, it remembers the position of a scan pointer (a position inside the string we're parsing). The scanning itself is a process of advancing the scan pointer through the string a small step at a time. For this there are two core methods:
-
match(regex)
matches a regex starting at current position of scan pointer, advances the scan pointer to the end of the match and returns the matching string. When regex doesn't match, returnsnil
. -
look(regex)
does all the same, except it doesn't advance the scan pointer, so it's use is to look ahead.
Let's visualize how scanning works with an example of parsing an @author tag that can either take a name or e-mail address plus a name:
* @author <[email protected]> John Doe
* @author Code Monkey
Here's a parse_doc
method for parsing this tag:
def parse_doc(scanner, position)
if scanner.look(/</)
scanner.match(/</)
email = scanner.match(/\w+@\w+(\.\w+)+/)
scanner.match(/>/)
scanner.hw
end
name = scanner.match(/.*$/)
return { :tagname => :author, :name => name, :email => email }
end
Let's step through it while it's parsing the first line of our example code.
Here's the state of the Scanner at the time parse_doc
gets called.
# @author |<[email protected]> John Doe
The scan pointer (denoted as |
) has stopped at the first
non-whitespace character after the name of the tag. At that point we
could look ahead to see what's coming. Say, we could check if we're
at the beginning of an e-mail address block:
if scanner.look(/</) # @author |<[email protected]> John Doe
If so, we want to extract the e-mail address. But first lets match the
<
char which we want to exclude from our e-mail address:
scanner.match(/</) # @author <|[email protected]> John Doe
The scan pointer has now moved forward a step, and now we can match the e-mail address itself and store it to a variable:
email = scanner.match(/\w+@\w+(\.\w+)+/) # @author <[email protected]|> John Doe
Then we skip the closing >
:
scanner.match(/>/) # @author <[email protected]>| John Doe
And let's also skip the whitespace using hw
method of Scanner to
skip just the horizontal whitespace:
scanner.hw # @author <[email protected]>| John Doe
From here on we just want to match the name of the author, which could be anything, so we just match up to the end of a line:
name = scanner.match(/.*$/) # @author <[email protected]> John Doe|
Finally we return all the extracted values:
return { :tagname => :author, :name => "John Doe", :email => "[email protected]" }