Regex trong java với ví dụ cụ thể

Mục lục

1 Logical Operators
- 1.1 XY
- 1.2 X|Y
2 (X)
3 Back references
4 Special constructs
5 Modifier Flags
- 5.1 Case Insensitive Mode
- 5.2 Multi-line Mode

Ở phần 1, chúng ta đã tìm hiểu về Pattern class, Matcher class cùng các method của chúng. Ngoài ra chúng ta còn biết cách để viết một chuỗi regex để kiểm tra chuỗi đầu vào với các biểu thức chính quy được xây dựng sẵn như Boundary matchers, Quantifiers etc.

Ở phần này chúng ta sẽ tìm hiểu thêm các biểu thức chính quy được xây dựng sẵn còn lại được java cung cấp.

Logical Operators

Biểu thức	Mô tả
XY	Khớp với X theo sau là Y
X\|Y	Khớp với X hoặc Y
(X)	Bắt nhóm

XY

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        /*
         * The Logical Operator XY matches the X followed by Y.
         */
        Pattern pattern = Pattern.compile("to");
        Matcher matcher = pattern.matcher("Welcome to https://shareprogramming.net//");

        while (matcher.find()) {
            System.out.println(matcher.group() + ", Match String start(): "
                    + matcher.start());
        }
    }
}

Output: to, Match String start(): 8

X|Y

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        /*
         * The Logical Operator X|Y matches the X or Y.
         */
        Pattern pattern = Pattern.compile("t|o");
        Matcher matcher = pattern.matcher("Welcome to https://shareprogramming.net//");

        while (matcher.find()) {
            System.out.println(matcher.group() + ", Match String start(): "
                    + matcher.start());
        }
    }
}

Output:
o, Match String start(): 4
t, Match String start(): 8
o, Match String start(): 9
t, Match String start(): 12
t, Match String start(): 13
o, Match String start(): 26
t, Match String start(): 38

(X)

Bắt nhóm là phương pháp để kiểm tra nhiều ký tự trong 1 lần. Mỗi nhóm được đặt trong cặp dấu ngoặc đơn.

Các nhóm được đánh số bằng cách đếm các dấu ngoặc đơn mở từ trái sang phải. Cho ví dụ ((A)(B(C))) chúng ta có 4 nhóm bao gồm:

((A)(B(C)))
(A)
(B(C))
(C)

Để kiểm tra có bao nhiêu group trong biểu thức ta sử dụng groupCount() trong matcher object. Nhóm ((A)(B(C))) đại diện cho toàn bộ biểu thức nên sẽ không được tính khi gọi method groupCount().

// file Main.java
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String line = "shareprogramming.net start2019! ok!";
        String pattern = "(.*?)(\\d+)(.*)";

        Pattern r = Pattern.compile(pattern);

        Matcher m = r.matcher(line);

        System.out.println("Group: " + m.groupCount());

        if (m.find()) {
            System.out.println("Found value: " + m.group(1));
            System.out.println("Found value: " + m.group(2));
            System.out.println("Found value: " + m.group(3));
        } else {
            System.out.println("NO MATCH");
        }
    }
}

Output:

Group: 3
Found value: shareprogramming.net start
Found value: 2019
Found value: ! ok!

Note:

Group 1 – .*?: là reluctant quantifier nên nó sẽ khớp với chuỗi vừa đủ đến phần khớp với group 2 (. khớp với bất kỳ ký tự nào).

Group 2 – \\d+: Khớp với một hoặc nhiều ký số, vì nó là greedy quantifier nên nó sẽ khớp nhiều nhất có thể đến phần khớp của group 3.

Group 3 = .*: Khớp với bất kỳ ký tự nào, chỉ còn lại ! ok!

Back references

Back references cho phép chúng ta sử dụng lại các group đã được định nghĩa trước đó. Chúng ta có thể tham khảo lại các group trước bằng cú pháp \n.

Ví dụ 1: Kiểm tra chuỗi số lặp lại

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "123123";
        Pattern p = Pattern.compile("(\\d\\d\\d)\\1");
        Matcher m = p.matcher(str);
        System.out.println(m.groupCount());
        while (m.find()) {
            String word = m.group();
            System.out.println(word + " " + m.start() + " " + m.end());
        }
    }
}

Output:

1
123123 0 6

Special constructs

Chúng ta có các special contructs sau:

(?:X) – non – capturing group
(?=X) – positive look – ahead
(?!X) – negative look – ahead
(?<=X) – positive look – behind
(?<!X) – negative look – behind
(?<X) – independent, non-capturing group

(?:X) – non-capturing group

Biểu thức (?:X) sẽ lượt nhóm X trong kết quả trả về.

Ví dụ:

Input: https://shareprogramming.net//

Áp dụng chuỗi regex : (https? | http): // ([^ / \ r \ n] +) (/ [^ \ r \ n] *)?

Output:

https://shareprogramming.net//

https

shareprogramming.net

/

Mình không quan tâm về protocol (https, http etc) nên mình muốn lượt bỏ nó ra khỏi kết quả

Mình sẽ áp dụng biểu thức (?:x)

Biểu thức được viết lại như sau:

(?:https? | http): // ([^ / \ r \ n] +) (/ [^ \ r \ n] *)?

Output:

https://shareprogramming.net//

shareprogramming.net

/

Code tham khảo

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "https://shareprogramming.net//";
        Pattern p = Pattern.compile("(?:https?|http)://([^/\r\n]+)(/[^\r\n]*)?");
        Matcher m = p.matcher(str);
        while (m.find()) {
            String word = m.group();
            System.out.println(word);
            System.out.println(m.group(1));
            System.out.println(m.group(2));

        }
    }
}

(?=X) – positive look – ahead

Giả sử rằng chúng ta muốn biết trong chuỗi input có xuất hiện “incident” mà không quan tâm đến vị trí etc. Sử dụng look – ahead, nó sẽ tiến hành đi tìm group X trong toàn bộ chuỗi.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "There was a crime incident and other incident";
        Pattern p = Pattern.compile("(.*)(?=incident)(.*)");
        Matcher m = p.matcher(str);
        System.out.println(m.matches());
    }
}

Output: true

(?!X) – negative look – ahead

Ngược lại với positive look – ahead, nó sẽ không khớp group X trong toàn bộ chuỗi.

Ví dụ bạn muốn tìm từ “incident” nhưng lại không muốn “theft” xuất hiện.

import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String regex = "(?!.*theft).*incident.*";
        System.out.println("Case 1: " + Pattern.matches(regex, "The incident involved a theft"));
        System.out.println("Case 2: " + Pattern.matches(regex, "There was a crime incident and other incident"));
    }
}

Output:

Case 1: false
Case 2: true

(?<=X) – positive look – behind

Look – behind gần như giống với cơ chế hoạt động look – ahead mà chúng ta đã tìm hiểu ở phần trên. Nhưng nó còn cho phép chúng ta khớp với một pattern chỉ khi có một thứ gì đó ở trước.

Ví dụ mình muốn tìm số tiền đơn vị USD trong chuỗi input, tiền usd thường có định dạng là ${nunber}

import java.util.regex.Matcher;
        import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "1 turkey costs $30";
        Pattern p = Pattern.compile("(?<=\\$)\\d+");
        Matcher m = p.matcher(str);
        System.out.println(m.matches());
        while (m.find()) {

            System.out.println(str.substring(m.start(), m.end()));
        }
    }
}

Output: 30

(?<!X) – negative look – behind

Ngược lại với (?<=X) cho phép chúng ta tìm kiếm một chuỗi pattern mà không có gì đó đứng sau.

Ví dụ: mình có một chuỗi dãy số, nhưng do bị trục trục nên đầu mỗi số bị nhiễu bởi các ký tự từ a-Z. Không còn cách nào khác, mình mong muốn lấy những số chưa bị nhiểm ra để tránh bị hư hỏng dữ liệu nhiều hơn.

import java.util.regex.Matcher;
        import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "v3 a7 07 n4 59 23 ";
        Pattern p = Pattern.compile("(?<![a-zA-Z])\\d+");
        Matcher m = p.matcher(str);
        System.out.println(m.matches());
        while (m.find()) {
            System.out.println(m.start() + " - " + m.end());
            System.out.println(str.substring(m.start(), m.end()));
        }
    }
}

Modifier Flags

Java regex cung cấp cho chúng ta một số các flag dùng để modifier khi chúng ta so khớp, tìm kiếm etc.

Với mỗi flag chúng ta có thể bật trực tiếp trong chuỗi regex, hoặc sử dụng Pattern.compile() để truyền các hằng số biểu diễn cho các flag xác định.

Case Insensitive Mode

Mặc định java regex sẽ so khớp với case sensitive nghĩa là phân biệt chữ hoa và chữ thường.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "Stew Pasta Twinkies";

        // Mac dinh java regex bat mode sensitive

        System.out.println("Default ");
        Matcher m0 = Pattern.compile("\\b[a-z]+\\b").matcher(str);
        while (m0.find()) {
            System.out.println(str.substring(m0.start(), m0.end()));
        }


        /// Bat case-insensitive
        System.out.println("\nCase Pattern.CASE_INSENSITIVE");
        Matcher m1 = Pattern.compile("\\b[a-z]+\\b", Pattern.CASE_INSENSITIVE).matcher(str);
        while (m1.find()) {
            System.out.println(str.substring(m1.start(), m1.end()));
        }


        // Bat case-insensitive su dung group
        System.out.println("\nCase FLAG");
        Matcher m2 = Pattern.compile("(?i)\\b[a-z]+\\b").matcher(str);
        while (m2.find()) {
            System.out.println(str.substring(m2.start(), m2.end()));
        }

        // Case-insensetive co the su dung o bat cu dau trong chuoi regex
        System.out.println("\nUse insensetive anywhere");
        Matcher m3 = Pattern.compile("[0-9]*(?i)\\b[a-z]+\\b").matcher(str);
        while (m3.find()) {
            System.out.println(str.substring(m3.start(), m3.end()));
        }

        // Tat case-insensitive su dung group
        System.out.println("\nCase disable insensitive use group");
        Matcher m4 = Pattern.compile("(?-i)\\b[a-z]+\\b").matcher(str);
        while (m4.find()) {
            System.out.println(str.substring(m4.start(), m4.end()));
        }
    }
}

Output

Default

Case Pattern.CASE_INSENSITIVE

Stew

Pasta

Twinkies

Case FLAG

Default

Case Pattern.CASE_INSENSITIVE
Stew
Pasta
Twinkies

Case FLAG
Stew
Pasta
Twinkies

Use insensetive anywhere
Stew
Pasta
Twinkies

Case disable insensitive use group

Stew

Pasta

Twinkies

Use insensetive anywhere

Stew

Pasta

Twinkies

Case disable insensitive use group

Multi-line Mode

Khi mode này được bật lên, ^ và $ được sử dụng để so khớp đầu và cuối mỗi dòng.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        String str = "The First line\nThe SecondLine";

        System.out.println("Default");
        Matcher m0 = Pattern.compile("^T.*e").matcher(str);
        while (m0.find()) {
            System.out.println(str.substring(m0.start(), m0.end()));
        }


        System.out.println("\nAp dung ^ va $");
        Matcher m1 = Pattern.compile("^T.*e$").matcher(str);
        while (m1.find()) {
            System.out.println(str.substring(m1.start(), m1.end()));
        }


        System.out.println("\nAp dung multi line");
        Matcher m2 = Pattern.compile("^T.*e$", Pattern.MULTILINE).matcher(str);
        while (m2.find()) {
            System.out.println(str.substring(m2.start(), m2.end()));
        }


        Matcher m3 = Pattern.compile("(?m)^T.*e$").matcher(str);
        System.out.println("\nAp dung multi line bang su dung group");
        while (m3.find()) {
            System.out.println(str.substring(m3.start(), m3.end()));
        }

    }
}

Output:

Default
The First line

Ap dung ^ va $

Ap dung multi line
The First line
The SecondLine

Ap dung multi line bang su dung group
The First line
The SecondLine

Ngoài 2 flag trên, chúng ta còn một số flag khác như:

Dot – All
Comment and white-spaces
Unicode – Aware
Literal Parsing
Unix Line
Unicode Canonial Equivalence

Chúng ta có khá nhiều mode, mình chỉ liệt kê và lấy ví dụ cho 2 flag mà theo mình là sử dụng nhiều, các flag khác các bạn hoàn toàn có thể xem tại oracle docs

‹Previous Next›

Deft Blog

Regex trong java với ví dụ cụ thể – Part 2