More about parsing

In previous chapter we only took a glance at the parse_doc method, here we take a deeper look at the Scanner object which is passed to the parse_doc method.

The Scanner is similar to Ruby builtin StringScanner, it remembers the position of a scan pointer (a position inside the string we're parsing). The scanning itself is a process of advancing the scan pointer through the string a small step at a time. For this there are two core methods:

match(regex) matches a regex starting at current position of scan pointer, advances the scan pointer to the end of the match and returns the matching string. When regex doesn't match, returns nil.
look(regex) does all the same, except it doesn't advance the scan pointer, so it's use is to look ahead.

Let's visualize how scanning works with an example of parsing an @author tag that can either take a name or e-mail address plus a name:

* @author <[email protected]> John Doe
* @author Code Monkey

Here's a parse_doc method for parsing this tag:

def parse_doc(scanner, position)
  if scanner.look(/</)
    scanner.match(/</)
    email = scanner.match(/\w+@\w+(\.\w+)+/)
    scanner.match(/>/)
    scanner.hw
  end
  name = scanner.match(/.*$/)

  return { :tagname => :author, :name => name, :email => email }
end

Let's step through it while it's parsing the first line of our example code.

Here's the state of the Scanner at the time parse_doc gets called.

                                            # @author |<[email protected]> John Doe

The scan pointer (denoted as |) has stopped at the first non-whitespace character after the name of the tag. At that point we could look ahead to see what's coming. Say, we could check if we're at the beginning of an e-mail address block:

if scanner.look(/</)                      # @author |<[email protected]> John Doe

If so, we want to extract the e-mail address. But first lets match the < char which we want to exclude from our e-mail address:

scanner.match(/</)                        # @author <|[email protected]> John Doe

The scan pointer has now moved forward a step, and now we can match the e-mail address itself and store it to a variable:

email = scanner.match(/\w+@\w+(\.\w+)+/)  # @author <[email protected]|> John Doe

Then we skip the closing >:

scanner.match(/>/)                        # @author <[email protected]>| John Doe

And let's also skip the whitespace using hw method of Scanner to skip just the horizontal whitespace:

scanner.hw                                # @author <[email protected]>| John Doe

From here on we just want to match the name of the author, which could be anything, so we just match up to the end of a line:

name = scanner.match(/.*$/)              # @author <[email protected]> John Doe|

Finally we return all the extracted values:

  return { :tagname => :author, :name => "John Doe", :email => "[email protected]" }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More about parsing

Clone this wiki locally