Improved Bambenek parser to better handle description#1451
Conversation
amojamo
commented
Sep 13, 2019
- Changed value to values as it makes more sense.
- Use of regex to read the description for better robustness. As it stands now, there is a conflict when the Bambenek parser reads the IP list. This is because there is a slight change in the Bambenek IP list, where they have a longer description with more commas than usual.
|
|
||
| else: | ||
| value = line.split(',') | ||
| m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ |
There was a problem hiding this comment.
Could we change .* (greedy) to .*? (non-greedy) here?
There was a problem hiding this comment.
Ah, I'm sorry, I didn't consider IPv6. Regex for an IPv6 pattern is out of my scope.
The greedy to non-greedy works for the description group, but not for the URL group.
| value = line.split(',') | ||
| m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ | ||
| (?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line) | ||
| values = m.groups() |
There was a problem hiding this comment.
That line raises an exception in the tests
There was a problem hiding this comment.
It could be my line split. I'm not sure how to break the regex line into two lines in order not to trigger the "Line too long" warning when it comes to code style checking.
There was a problem hiding this comment.
Just split the line like this:
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), "
r"(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)| else: | ||
| value = line.split(',') | ||
| m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ | ||
| (?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line) |
There was a problem hiding this comment.
Could we change .* (greedy) to .*? (non-greedy) here?
Can we $ at the end to indicate the regular expression should match the full line?
They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):
This has since been changed to the following:
The problem was that the parsing failed because there were more commas than anticipated, so This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be. |
Well, that's obvious.
If the regular expression itself is stable - yes. I opened https://github.com/amojamo/intelmq/pull/1 for some tests which you could use for development. In case the format is proper CSV (using commas is ok then if properly escaped), we can use the csv parser of python itself like with any csv-based feed. That's IMHO the best option. |
TST: add bambenek tests for commas in descriptions
|
Are you still working on this? The tests are still failing on |