Unix
3.
There is a magical kind of regexp you can use to remove runs of things. Try s/<[^>]*>//g
This says to remove every tag (literally, < followed by any non->’s, followed by >) on each line.
YOUR TURN: Try s/<.*>//g and explain why this removes much more than the intended HTML tags.
YOUR TURN: You could actually make HTML more readable by adding an extra ” ___ ” before each “>”. Try this and show a few lines. Is it easier on the eyes?
4.
Go back to your Subaru data. There is a way to replace what you matched by putting a “&” in the replacement string. This is called a “backreference” and it works like this:
Try the sed command “s/.*/&nn/”
This says to match all characters on each line, and replace it with whatever matched, followed by two newlines. Did it work?
Try this: s/.*mi/tttt&n/
YOUR TURN: Why did the last expression work? Can you make it so that only the mileage on the car is tabbed over and followed by an extra empty line? The expression I gave is matching both the mileage on the car and the distance of the seller from Springfield, IL.
YOUR TURN: Can you write an s expression that moves over the miles AND the $ value? It would look something like s/$[0-9]|[0-9] mi/tttt&/ but you’ll have to improve the matching part.
5.
Go to this page and grab all the data, ctrl-a, ctrl-c, and put it in the top box on your sed page: http://www.baseball-reference.com/leaders/WAR_top_ten.shtml
Now, we want to erase Barry Bonds from the record book, but not Bobby Bonds. This is not so easy, but we can do it. Actually I am not that angry at Barry Bonds — I’ve got much more Rafael Palmeiro disgust —
First, we can just say s/Barry Bonds//
and we will remove him from each of the two lines that show Barry Bonds’ career occurrences in the top ten each year in this valuable statistic (WAR).
YOUR TURN: Actually, can you remove Barry Bonds AND the associated stat, leaving just two empty lines? What s-expression did you use?
Now, we can remove Bonds from each of the years by noticing that the 80s, 90s, and 2000s belonged to Barry, and the 70s belonged to Bobby.
Try this: /^19[89][0-9]/ s/Bonds/XXXX/
This says to apply the s-expression which replaces Bonds with XXXX ONLY on those lines that start with a 1980s or 1990s year.
YOUR TURN: You can write an expression like the last one that catches all the 1980s, 1990s, AND 2000s?
YOUR TURN: Notice that we are redacting his name, but not removing the stat. Can you change the s-expression so it applies only to Barry’s years, AND it completely removes his name and his parenthesized statistic?
6.
ADVANCED CHALLENGE
First, notice that you can chain your match-and-substitute/pattern/replacement expressions with a ;
So you could say
/Barry Bonds/ s/.*// ; /^200[0-9]/ s/Bonds…..//g
and do both s-expressions in a single pass.
Except that the latter regexp match is not quite right.
YOUR TURN: Repair it.
YOUR TURN: Add an s-expression to change those annoying %09s into something like a _.
Some readings and videos will be posted soon for sed…