https://www.sw.it.aoyama.ac.jp/2021/pub/RubyKaigi/
duerst@it.aoyama.ac.jp, Aoyama Gakuin University
© 2021 Martin J. Dürst, Aoyama Gakuin University
Many Ruby programmers use regular expressions frequently. They are an amazingly powerful tool for many different kinds of text processing. However, if not used carefully, they can also be dangerous: They may not exactly match what their writer thinks they match, and they may execute very slowly on certain inputs. This talk will help you understand regular expressions better, so that you can make good use of their amazing power while avoiding their dangerous sides. It will also discuss recent changes to Ruby in the area of regular expressions.
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Regular Expressions can be Amazingly convenient but also extremely Dangerous
Feature #17837: Add support for Regexp timeouts (by Sam Saffron)
How to detect/prevent dangerous regular expressions
Dangerous: Very slow (on some inputs)
Current ideas:
Regexp.timeout
)Rexexp.backtrack_limit=
,
Regexp.last_backtrack_count
)[1] On the Impact and Defeat of Regular Expression Denial of Service, James C. Davis, PhD thesis, Virginia Politechnic, 2020.
Code is mostly green, monospace
puts 'Hello Ruby!'
Variable parts are orange
puts "some string"
Results are indicated with "⇒"
1 + 1
⇒ 2
Mainly in the following areas:
String#encode
, Ruby 1.9)String#unicode-normalize
, Ruby
2.2)String#upcase
,..., Ruby 2.4)/Perl|Python|Pascal/
String#unicode_normalize
)"1"*63 =~ /^(11+?)\1*$/; puts $1.size
⇒
7
)[1] Fun with Regular Expressions: An Implementation of the Unicode Bidi Algorithm, Martin J. Dürst, 26th Internationalization & Unicode Conference, San Jose, USA, Sept. 2004
/^(11+?)\1*$/
How does it work?
/^___$/
: match from start to end, not just a part
1+
: one or more 1
s
11+
: two or more 1
s
11+?
: two or more 1
s, lazy (prefer as few as
possible)
(___)
: remember what you matched (as \1
and $1
)
\1*
: what you remembered, 0 or more times
"1"*9 =~ /^(11+?)\1*$/
: Smallest prime factor of 9
First try: (11) 11 11 11 1
⇒ fail
Second try: (111) 111 111
⇒ succeed, 3
"1"*63 =~ /^(11+?)\1*$/; puts $1.size
⇒
7
, 0.054s
"1"*11939 =~ /^(11+?)\1*$/; puts $1.size
⇒
11939
, 0.078s
"1"*99371 =~ /^(11+?)\1*$/; puts $1.size
⇒
99371
, 2.698s
"1"*193939 =~ /^(11+?)\1*$/; puts $1.size
⇒
193939
, 10.930s
⇒ Regular expressions are dangerous
(prime numbers from Wikipedia)
"Some people, when confronted with a problem, think
'I know, I'll use regular expressions.'
Now they have two problems."
- Jamie Zawiski
[1] The 'mailto' URI Scheme, M. Dürst, L. Masinter, and J. Zawinski, RFC 6068, October 2010.
+
: more than one; ?
: zero or one;
+?
: more than one, lazy)⇒ medium writability, very low readability
irb> 'a'.unicode_normalize :nfc
irb> p UnicodeNormalize::REGEXP_C
/[̀́̓̈́ʹ;·क़-य़ড়ঢ়য়ਲ਼ਸ਼ਖ਼-ਜ਼ਫ਼ଡ଼ଢ଼གྷཌྷདྷབྷཛྷཀྵཱཱིུྲྀླཱྀྀྒྷྜྷྡྷྦྷྫྷྐྵάέήίόύώΆιΈΉΐΊΰΎ΅`ΌΏ´ ΩKÅ〈〉⫝̸豈-嗀塚晴凞-羽蘒諸逸都飯-舘並-龎יִײַשׁ-זּטּ-לּמּנּסּףּפּצּ-פֿ텞-텤톻-퇀-精][̀-͎͐-ͯ҃-֑҇-ׇֽֿׁׂׅׄؐ-ًؚ-ٰٟۖ-ۜ۟-۪ۤۧۨ-ܑۭܰ-݊߫-߽߳ࠖ-࠙ࠛ-ࠣࠥ-ࠧࠩ-࡙࠭-࡛࣓-ࣣ࣡-़्ࣿ॑-়॔া্ৗ਼઼଼੍્৾ା୍ୖୗா்ௗ಼్ౕౖೂ್ೕೖ഻഼ാ്ൗ්ාෟุ-ฺ่-๋ຸ-຺່-໋ཱིེུ༹༘༙༵༷-ཽྀྂ-྄࿆྆྇ီ့္်ႍ፝-᜔᜴្᤹ᢩ፟៝-᩠᤻ᨘᨗ᩵-᩿᩼᪰-᬴᪽ᬵ᭄᭫-᯦᰷᮪᮫᯲᯳᭳᳐-᳔᳒-᳢᳠-᳨᳭᳴᳸᳹᷀-᷹᷻-᷿⃐-⃥⃜⃡-⃰⳯-⵿⳱ⷠ-〪ⷿ-゙゚〯꙯ꙴ-꠆꣄꙽ꚞꚟ꛰꛱꣠-꤫꣱-꦳꥓꧀꤭ꪰꪲ-꫶꯭ﬞꪴꪷꪸꪾ꪿꫁︠-︯ǽˠͶ-ͺਏਸ-ਿ૦ത-ധཆ-ཐ၆ၿႹႺᄀ-ᄂᄧᄳᄴᅳᇀᇊስሶዩዪጻጼጾፍፗ፦-፬፰-፴ᑂᑆᑞᒰᒺᒽᓂᓃᖯᖿᗀᘿᚶᚷᜫᠹᠺ᧠ᨴᩇ᪙᰿ᵂᵄᵅᶗ櫰-櫴欰-欶벞텥-텩텭-텲텻-톂톅-톋톪-톭퉂-퉄--------]*|[<->A-PR-Za-pr-z¨À-ÏÑ-ÖØ-Ýà-ïñ-öø-ýÿ-ďĒ-ĥĨ-İĴ-ķĹ-ľŃ-ňŌ-őŔ-ťŨ-ſƠơƯưƷǍ-ǜǞ-ǣǦ-ǰǴǵǸ-țȞȟȦ-ȳʒ΅ΆΈ-ΊΌΎ-ΑΕΗῚ?ΡΥΩ-αεηιορυω-ώϒ-ϔЀЁЃІЇЌ-ЎАГЕ-КОУЧЫЭаге-коучыэѐёѓіїќ-ўѴ-ѷӁӂӐ-ӓӖ-ӟӢ-ӵӸӹآ-اويۀ-ۂےۓەनऩरऱळऴেোৌେୈୋୌஒஔெேொ-ௌెైಿೀೆ-ೈೊೋെേൊ-ൌෙේො-ෞဥဦᬅ-ᬎᬑᬒᬺ-ᭃḀ-ẙẛẠ-ỹἀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ὰὲὴὶὸὺὼᾀ-ᾴᾶ-Ὰᾼ᾿῁-ῄῆ-ῈῊῌ-ῒῖ-Ὶ῝-ῢῤ-ῪῬ῭ῲ-ῴῶ-ῸῺῼ῾←→↔↚↛↮⇍-⇐⇒⇔∃∄∈∉∋∌∣-∦∼≁≃-≅≇-≉≍≠-≢≤≥≭-≽⊀-⊉⊑⊒⊢⊨⊩⊫-⊯⊲-⊵⋠-⋣⋪-⋭うか-ぢつ-どは-ぽゔゝゞウカ-ヂツ-ドハ-ポワ-ヲヴヷ-ヺヽヾ႙-ႜႥႫᄮᄯᄱᄲፇፋፌᒹᒻᒼᒾᖸ-ᖻ]?[̀-͎͐-ͯ҃-֑҇-ׇֽֿׁׂׅׄؐ-ًؚ-ٰٟۖ-ۜ۟-۪ۤۧۨ-ܑۭܰ-݊߫-߽߳ࠖ-࠙ࠛ-ࠣࠥ-ࠧࠩ-࡙࠭-࡛࣓-ࣣ࣡-़्ࣿ॑-়॔া্ৗ਼઼଼੍્৾ା୍ୖୗா்ௗ಼్ౕౖೂ್ೕೖ഻഼ാ്ൗ්ාෟุ-ฺ่-๋ຸ-຺່-໋ཱིེུ༹༘༙༵༷-ཽྀྂ-྄࿆྆྇ီ့္်ႍ፝-᜔᜴្᤹ᢩ፟៝-᩠᤻ᨘᨗ᩵-᩿᩼᪰-᬴᪽ᬵ᭄᭫-᯦᰷᮪᮫᯲᯳᭳᳐-᳔᳒-᳢᳠-᳨᳭᳴᳸᳹᷀-᷹᷻-᷿⃐-⃥⃜⃡-⃰⳯-⵿⳱ⷠ-〪ⷿ-゙゚〯꙯ꙴ-꠆꣄꙽ꚞꚟ꛰꛱꣠-꤫꣱-꦳꥓꧀꤭ꪰꪲ-꫶꯭ﬞꪴꪷꪸꪾ꪿꫁︠-︯ǽˠͶ-ͺਏਸ-ਿ૦ത-ധཆ-ເ?၆ၿႹႺᄀ-ᄂᄧᄳᄴᅳᇀᇊስሶዩዪጻጼጾፍፗ፦-፬፰-፴ᑂᑆᑞᒰᒺᒽᓂᓃᖯᖿᗀᘿᚶᚷᜫᠹᠺ᧠ᨴᩇ᪙᰿ᵂᵄᵅᶗ櫰-櫴欰-欶벞텥-텩텭-텲텻-톂톅-톋톪-톭퉂-퉄--------]+|[가개갸걔거게겨계고과괘괴교구궈궤귀규그긔기까깨꺄꺠꺼께껴꼐꼬꽈꽤꾀꾜꾸꿔꿰뀌뀨끄끠끼나내냐냬너네녀녜노놔놰뇌뇨누눠눼뉘뉴느늬니다대댜댸더데뎌뎨도돠돼되됴두둬뒈뒤듀드듸디따때땨떄떠떼뗘뗴또똬뙈뙤뚀뚜뚸뛔뛰뜌뜨띄띠라래랴럐러레려례로롸뢔뢰료루뤄뤠뤼류르릐리마매먀먜머메며몌모뫄뫠뫼묘무뭐뭬뮈뮤므믜미바배뱌뱨버베벼볘보봐봬뵈뵤부붜붸뷔뷰브븨비빠빼뺘뺴뻐뻬뼈뼤뽀뽜뽸뾔뾰뿌뿨쀄쀠쀼쁘쁴삐사새샤섀서세셔셰소솨쇄쇠쇼수숴쉐쉬슈스싀시싸쌔쌰썌써쎄쎠쎼쏘쏴쐐쐬쑈쑤쒀쒜쒸쓔쓰씌씨아애야얘어에여예오와왜외요우워웨위유으의이자재쟈쟤저제져졔조좌좨죄죠주줘줴쥐쥬즈즤지짜째쨔쨰쩌쩨쪄쪠쪼쫘쫴쬐쬬쭈쭤쮀쮜쮸쯔쯰찌차채챠챼처체쳐쳬초촤쵀최쵸추춰췌취츄츠츼치카캐캬컈커케켜켸코콰쾌쾨쿄쿠쿼퀘퀴큐크킈키타태탸턔터테텨톄토톼퇘퇴툐투퉈퉤튀튜트틔티파패퍄퍠퍼페펴폐포퐈퐤푀표푸풔풰퓌퓨프픠피하해햐햬허헤혀혜호화홰회효후훠훼휘휴흐희히][ᆨ-ᇂ]|[ᄀ-ᄒ][ᅡ-ᅵ][ᆨ-ᇂ]?/x
data/production-regexes/uniq-regexes-8.json
used,rubygems
[1] Why
Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use
and Portability of Regular Expressions,
J.C. Davis, L.G. Michael IV, C.A Coghlan, F. Servant, and D. Lee,
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering, August 2019, pp. 443–454
/"([^"]*(&[ ]*[\n\r]+)?)*"/
/"([^"]*(&[ ]*[\n\r]+)?)*"/
/"( [^"]* ( & [ ]* [\n\r]+ )? )* /x
/"([^"]*(&[ ]*[\n\r]+)?)*"/
"
】, pump: 【&\n
】, suffix:
【】prefix + pump*n_pumps + suffix
/"([^"]*(&[ ]*[\n\r]+)?)*"/
"
】, pump: 【&\n
】, suffix:
【】pumps | attack vector | time [s] |
---|---|---|
1 | "&\n |
0.00096 |
2 | "&\n&\n |
0.00617 |
3 | "&\n&\n&\n |
1.21982 |
4 | "&\n&\n&\n&\n |
timeout (>10) |
(()*)*
)*+
,
++
, +*
,...)rubygems
Regexp.timeout
?
Regexp.backtrack_limit
?
My opinion:
Fixing ReDoS problem would make Ruby a "better citizen"
Basics available for evaluation of solutions, but more work needed
/["']([^"']+)["'].*already exists/
/"[^"]+".*end/
or /a[^a]+a.*e/
"bb"b"b"b"b"b"b"b"b"b"b"b"b"b"b"b"b"b"b...
pumps | time [s] |
---|---|
4096 | 0.054 |
40960 | 5.204 |
"b"
combinationalready
exists
.*
creates dangerous overlaps //x
to make regular expression structure visiblebacktrack_limit
implementationSend questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open/contribute to a bug report or
feature request
The latest version of this presentation is available at: