Discussion:
extract repeated text from string
Punit Jain
2018-07-18 14:07:51 UTC
Permalink
Hi,

I am working on an issue where i need to extract repeated text from an
string:

The string is abcdefzfabcdefzfabcdefzf

I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as :

abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf

However I am looking for output as :
abcdefzf
abcdefzf
abcdefzf

Any clues ?

Regards
Punit
Hassan Schroeder
2018-07-18 15:18:12 UTC
Permalink
Post by Punit Jain
I am working on an issue where i need to extract repeated text from an
The string is abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf
abcdefzf
abcdefzf
abcdefzf
Can you explain what the logic of the pattern is? This "works" for
your exact example:

2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
[0] [
[0] "abcdefzf"
],
[1] [
[0] "abcdefzf"
],
[2] [
[0] "abcdefzf"
]
]
2.5.1 (main):0 >

but might not be universally applicable...
--
Hassan Schroeder ------------------------ ***@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Robert Klemme
2018-07-18 15:28:48 UTC
Permalink
On Wed, Jul 18, 2018 at 5:18 PM Hassan Schroeder
Post by Hassan Schroeder
Post by Punit Jain
I am working on an issue where i need to extract repeated text from an
The string is abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf
abcdefzf
abcdefzf
abcdefzf
Can you explain what the logic of the pattern is? This "works" for
This! The original question sounds a bit like Punit was looking for a
mechanism to identify repeated text in the input. As long as no
pattern for that text is given, regex is not the right tool for the
job.

If you know the first character of the repeated part (or the repeated
string always starts at a specific position) then you can cook
something:

irb(main):017:0> s
=> "abcabcabcdab"
irb(main):018:0> s.scan /((.+)\2)/
=> [["abcabc", "abc"]]
irb(main):019:0> s.scan /((.+)\2+)/
=> [["abcabcabc", "abc"]]

irb(main):020:0> s="abcdeabcdabcd"
=> "abcdeabcdabcd"
irb(main):021:0> s.scan /((.+)\2+)/
=> [["abcdabcd", "abcd"]]

Cheers

robert
--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Punit Jain
2018-07-18 16:13:27 UTC
Permalink
Here is the actual usecase with input data

*Director Identification: AB-1A*

Director Type : FiberChannel
Director Status : Online
Director Slot No : 4

Director Port: 5
WWN Port Name :331123G56
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}

Director Port: 7
WWN Port Name :3323H66
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}

*Director Identification: AB-1B*

Director Type : FiberChannel
Director Status : Online
Director Slot No : 6

Director Port: 33
WWN Port Name :331123G56
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}

Need to extract Director Identification with respective Director Port which
can be 1 or many per identification.

Regards
Punit
Post by Robert Klemme
On Wed, Jul 18, 2018 at 5:18 PM Hassan Schroeder
Post by Hassan Schroeder
Post by Punit Jain
I am working on an issue where i need to extract repeated text from an
The string is abcdefzfabcdefzfabcdefzf
I tried using forward lookup as /(?=(a.*f))/ but this extracts groups
abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf
abcdefzf
abcdefzf
abcdefzf
Can you explain what the logic of the pattern is? This "works" for
This! The original question sounds a bit like Punit was looking for a
mechanism to identify repeated text in the input. As long as no
pattern for that text is given, regex is not the right tool for the
job.
If you know the first character of the repeated part (or the repeated
string always starts at a specific position) then you can cook
irb(main):017:0> s
=> "abcabcabcdab"
irb(main):018:0> s.scan /((.+)\2)/
=> [["abcabc", "abc"]]
irb(main):019:0> s.scan /((.+)\2+)/
=> [["abcabcabc", "abc"]]
irb(main):020:0> s="abcdeabcdabcd"
=> "abcdeabcdabcd"
irb(main):021:0> s.scan /((.+)\2+)/
=> [["abcdabcd", "abcd"]]
Cheers
robert
--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Hassan Schroeder
2018-07-18 16:18:18 UTC
Permalink
Post by Punit Jain
Here is the actual usecase with input data
LOL, this doesn't look much like your original question, but...
Post by Punit Jain
Need to extract Director Identification with respective Director Port which
can be 1 or many per identification.
What *exactly* does the output look like?

Just e.g. "Director Identification: AB-1B Director Port: 33" or more?
--
Hassan Schroeder ------------------------ ***@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Hassan Schroeder
2018-07-18 16:48:18 UTC
Permalink
On Wed, Jul 18, 2018 at 9:18 AM, Hassan Schroeder
Post by Hassan Schroeder
What *exactly* does the output look like?
Just e.g. "Director Identification: AB-1B Director Port: 33" or more?
I would also ask if that indentation is consistent, so something like
this would work:

2.5.1 (main):0 > output.scan /\s{,2}(\w+ \w+):\s*([^\s]+)/
=> [
[0] [
[0] "Director Identification",
[1] "AB-1A"
],
[1] [
[0] "Director Port",
[1] "5"
],
[2] [
[0] "Director Port",
[1] "7"
],
[3] [
[0] "Director Identification",
[1] "AB-1B"
],
[4] [
[0] "Director Port",
[1] "33"
]
]
2.5.1 (main):0 >
--
Hassan Schroeder ------------------------ ***@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Punit Jain
2018-07-18 16:55:10 UTC
Permalink
expected o/p -

"Director Identification":"AB-1A","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":5,"WWN Port
Name":"331123G56","SCSI Flags":{"Sequence(SEQ)":"Disabled"},"Director
Port":{ "Port":7,"WWN Port Name":"3323H66","SCSI Flags":{"Sequence(SEQ)":"
Disabled"}


"Director Identification":"AB-1B","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":33,"WWN Port Name":"331123G56","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}


Regards,

Punit



On Wed, Jul 18, 2018 at 9:48 PM, Hassan Schroeder <
Post by Hassan Schroeder
Post by Punit Jain
Here is the actual usecase with input data
LOL, this doesn't look much like your original question, but...
Post by Punit Jain
Need to extract Director Identification with respective Director Port
which
Post by Punit Jain
can be 1 or many per identification.
What *exactly* does the output look like?
Just e.g. "Director Identification: AB-1B Director Port: 33" or more?
--
Consulting Availability : Silicon Valley or remote
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Saverio M.
2018-07-18 17:07:27 UTC
Permalink
Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Punit Jain
2018-07-18 17:21:10 UTC
Permalink
You are right Saverio, this is to be converted to JSON. I initially did the
same, however got parse error:

System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in
`parse': 757: unexpected token at 'Director Identification: AB-1A
(JSON::ParserError)

I think input will require lot of sanitization and converting to required
format. Thats why I planned to go down the route of using regex with scan
method, however facing problem in parsing with right regex.

Regards,
Punit
Hello Punit,
```ruby
require "json"
parsed_output = JSON.parse(output_string)
# [...]
```
With this you can easily manage `parsed_output`, which is a Hash.
You didn't copy/paste the text correctly, regardless of it being JSON or
not. There are 4 opening braces and 2 closing ones.
Z
expected o/p -
"Director Identification":"AB-1A","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":5,"WWN Port
Name":"331123G56","SCSI Flags":{"Sequence(SEQ)":"Disabled"},"Director
Port":{ "Port":7,"WWN Port Name":"3323H66","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}
"Director Identification":"AB-1B","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":33,"WWN Port Name":"331123G56","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}
Regards,
Punit
On Wed, Jul 18, 2018 at 9:48 PM, Hassan Schroeder <
Post by Hassan Schroeder
Post by Punit Jain
Here is the actual usecase with input data
LOL, this doesn't look much like your original question, but...
Post by Punit Jain
Need to extract Director Identification with respective Director Port
which
Post by Punit Jain
can be 1 or many per identification.
What *exactly* does the output look like?
Just e.g. "Director Identification: AB-1B Director Port: 33" or more?
--
Consulting Availability : Silicon Valley or remote
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Hassan Schroeder
2018-07-18 17:34:17 UTC
Permalink
Post by Punit Jain
System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in
`parse': 757: unexpected token at 'Director Identification: AB-1A
(JSON::ParserError)
The input you showed is not remotely valid JSON.
Post by Punit Jain
I think input will require lot of sanitization and converting to required
format. Thats why I planned to go down the route of using regex with scan
method, however facing problem in parsing with right regex.
Trying to create a single regex to parse this seems like a horrible
idea to me; time-consuming and bound to be brittle.

I would parse out the individual lines into `key: value` pairs and build
an object from that. Then write your formatter to take that object as
input and output the JSON you want.
--
Hassan Schroeder ------------------------ ***@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Lee Roberts
2018-07-18 15:28:56 UTC
Permalink
I don't think a look ahead is necessary if your example matches the real
world scenario
A scan with a regex of the string you are looking for should return all the
matches
here's an example from the pry repl

[5] pry(main)> reggie = /abcdefz/
=> /abcdefz/
[6] pry(main)> stringie.scan(reggie)
=> []
[7] pry(main)> stringie = "abcdefz" * 20
=>
"abcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefz"
[8] pry(main)> stringie.scan(reggie)
=> ["abcdefz",
"abcdefz",
"abcdefz",
"abcdefz",
...
"abcdefz"]
[9] pry(main)> stringie.scan(reggie).count
=> 20

I hope this is helpful and I understood the question properly

On Wed, Jul 18, 2018 at 11:18 AM Hassan Schroeder <
Post by Hassan Schroeder
Post by Punit Jain
I am working on an issue where i need to extract repeated text from an
The string is abcdefzfabcdefzfabcdefzf
I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as
abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf
abcdefzf
abcdefzf
abcdefzf
Can you explain what the logic of the pattern is? This "works" for
2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
[0] [
[0] "abcdefzf"
],
[1] [
[0] "abcdefzf"
],
[2] [
[0] "abcdefzf"
]
]
2.5.1 (main):0 >
but might not be universally applicable...
--
Consulting Availability : Silicon Valley or remote
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Punit Jain
2018-07-18 16:01:52 UTC
Permalink
Hi Hasan,

I have a below string variable containing output of an EMC storage remote
command. The output looks like this:

*Director Identification: AB-1A*

Director Type : FiberChannel
Director Status : Online
Director Slot No : 4

Director Port: 5
WWN Port Name :331123G56
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}

Director Port: 7
WWN Port Name :3323H66
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}

*Director Identification: AB-1B*

Director Type : FiberChannel
Director Status : Online
Director Slot No : 6

Director Port: 33
WWN Port Name :331123G56
Director Port Status :PendOn
SCSI Flags
{
Sequence(SEQ) :Disabled
SCSI_Support1(OS2007) :Enabled
}


If you see here there are 2 loops :

1. One outer *Director Identification*
2. The each outer *Director Identification* has inner Director Port: loop

I need to extract outer and for each outer inner loops to process. Here is
what I am doing:

cmdoutput_nonewline = cmdoutput.gsub("\n",'|')

directorids = cmdoutput_nonewline.scan(/(?=(Director Identification.*?\|))/)

puts "#{directorids.size}"

directorids.each do |directorid|

puts directorid

end

This doesnot give required o/p, rather prints :

Director Identification: AB-1A|

Director Identification: AB-1B|


Regards,
Punit



On Wed, Jul 18, 2018 at 8:48 PM, Hassan Schroeder <
Post by Hassan Schroeder
Post by Punit Jain
I am working on an issue where i need to extract repeated text from an
The string is abcdefzfabcdefzfabcdefzf
I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as
abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf
abcdefzf
abcdefzf
abcdefzf
Can you explain what the logic of the pattern is? This "works" for
2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
[0] [
[0] "abcdefzf"
],
[1] [
[0] "abcdefzf"
],
[2] [
[0] "abcdefzf"
]
]
2.5.1 (main):0 >
but might not be universally applicable...
--
Consulting Availability : Silicon Valley or remote
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
Continue reading on narkive:
Loading...