02a-php_stringsx

Download Report

Transcript 02a-php_stringsx

Martin Kruliš
by Martin Kruliš (v1.0)
26. 2. 2015
1

One Charset to Rule Them All
◦ HTML, PHP, database (connection), text files, …
◦ Determined by the language(s) used
 Unicode covers almost every language
◦ Early incoming, late outgoing conversions

Charset in Meta-data
◦ Must be in HTTP headers
header('Content-Type: text/html; charset=utf-8');
◦ Do not use HTML meta element with http-equiv
 Except special cases (like saving HTML file locally)
by Martin Kruliš (v1.0)
26. 2. 2015
2

Multibyte Character Encoding
◦ Some charsets (e.g., UTF-8, UTF-16, …)
◦ Standard string functions are ANSI based
 They treat each byte as a char

Multibyte String Functions Library
◦ Standard library, often present in PHP
◦ Duplicates most of the standard string functions,
but with prefix mb_ (mb_strlen, mb_strpos, …)
◦ Encoding conversions mb_convert_encoding()
◦ mb_internal_encoding() – specifies the internal
encoding used in PHP
Example 1
by Martin Kruliš (v1.0)
26. 2. 2015
3

Encoding Input Data from HTTP
◦ Usually done transparently
 Check “mbstring” section of php.ini
◦ Can be done manually mb_parse_str()

Databases
◦ The database or the database connection usually
requires to be configured
◦ An example for MySQL database
 mysqli_set_charset()
by Martin Kruliš (v1.0)
26. 2. 2015
4

Lexicographical Comparison of Strings
◦ Best to be done elsewhere (in DBMS for instance)
◦ The strcmp() function is binary safe
◦ The locale must be set correctly (setlocale())

Iconv Library
◦ An alternative to Multibyte String Functions
◦ Fewer functions
◦ Easier for encoding conversions
 Can deal with missing mappings and replacements
by Martin Kruliš (v1.0)
26. 2. 2015
5

What to Sanitize
◦ Everything that possibly comes from users:
$_GET, $_POST, $_COOKIE, …
◦ Data that comes from external sources
(database, text files, …)

When to Sanitize
◦ On input
 At the beginning of the script
◦ On output
 When inserted into HTML, into SQL queries, …
by Martin Kruliš (v1.0)
26. 2. 2015
6

How to Verify
◦ Regular expressions
◦ Filter functions
 filter_input(), filter_var(), …
 Useful for special validations (e-mail, URL, IP, …)

How to Sanitize
◦
◦
◦
◦
String and filter functions, regular expressions
htmlspecialchars() – encoding for HTML
urlencode() – encoding for URL
DBMS-specific functions (mysqli_escape_string())
by Martin Kruliš (v1.0)
26. 2. 2015
7

String Search Patterns
◦ Special syntax that encodes a program (language)
for regular automaton
◦ Simple to use
 Encoding is (mostly) human readable
◦ POSIX and Perl Standards

Usage
◦ Searching strings, listing matches
◦ Find and replace
◦ Splitting a string into an array of strings
by Martin Kruliš (v1.0)
26. 2. 2015
8

Expression
◦ <separator>expr<separator>modifiers
◦ Separator is a single character (usually /, #, %, …)
◦ Pattern modifiers are flags that affect the evaluation

Base Syntax
◦ Sequence of atoms
◦ Atom could be
 Simple (non-meta) character (letter, number, …)
 Dot (.) represents any character
 A list of characters in [] ([abc], [0-9a-z_], …)
by Martin Kruliš (v1.0)
26. 2. 2015
9

Important Meta-characters
◦ \ - an escaping character for other meta-characters
◦ Anchors ^, $ marking start/end of a string/line
 ^ in character class definition inverts the set
◦ [,] – character class definition
◦ {,} – min/max quantifier atom{n}, atom{min,max}
 [0-9]{8} (8-digit number), .{1,9} (1-9 chars)
◦ (,) – subpattern (treated like an atom)
◦ *, +, ? – repetitions, shorthand notations of {0,},
{1,}, and {0,1} respectively
◦ | - branches (ptrn1|ptrn2)
by Martin Kruliš (v1.0)
26. 2. 2015
10

Character Classes
◦ Pre-defined classes identified by names [:name:]
 For example [ab[:digit:]] matches a, b, and 0-9
◦
◦
◦
◦
◦
◦
◦
◦
alpha – letters
digit – decimal digits
alnum – letters and digits
blank – horizontal whitespace (space and tab)
space – any whitespace (including line breaks)
lower, upper – lowercase/uppercase letters
cntrl – control characters
xdigit – hexadecimal digits
by Martin Kruliš (v1.0)
26. 2. 2015
11

Modifiers
i – case Insensitive
m – multiline mode (^,$ match start/end of a line)
s – '.' matches also a newline character
x – ignore whitespace in regex (except in character
class constructs)
◦ S – more extensive performance optimizations
◦ U – switch to not greedy evaluation
◦
◦
◦
◦
 Greedy evaluation means that patterns with *, +, or ?
tries to match as many characters as possible
by Martin Kruliš (v1.0)
26. 2. 2015
12

Subpatterns
◦ To ensure correct operation precedence
(one|two|three){1,3}
◦ To add modifiers to only a part of the expression
(?modifiers:ptrn)
◦ To mark important parts of the expression
 Used to retrieve parts of a string after matching
 Named subpatterns
(?<name>ptrn), or (?'name'ptrn)
 Unnamed subpatterns (no capturing in matching)
(?:ptrn)
by Martin Kruliš (v1.0)
26. 2. 2015
13

E-mail Verification (RFC 2822)
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/
=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21
\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*
")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?
[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c
\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c
\x0e-\x7f])+)\])
by Martin Kruliš (v1.0)
26. 2. 2015
14

preg_match($ptrn, $subj [,&$matches])
◦ Searches given string by a regex
◦ Returns true if the pattern matches the subject
◦ The matches array gathers the matched substrings
of subject with respect to the expression and
subpatterns
 Subpatterns are indexed from 1
 At index 0 is the entire expression
 Named patterns are indexed by their names
"6 eggs, 3 spoons of oil, 250g of flower"
~
/[[:digit:]]+/
array(1) {
[0] => string("6")
}
by Martin Kruliš (v1.0)
26. 2. 2015
15

preg_replace($ptrn, $repl, $str)
◦ Search and replace substrings in a string
 Each match of the pattern is replaced
 Replacement may contain references to subpatterns

preg_split($ptrn, $str [,$limit])
◦ Similar to explode() function
◦ Split a string into an array of strings
◦ The pattern is used to match delimiters
 Delimiters are not part of the result
Example 2
by Martin Kruliš (v1.0)
26. 2. 2015
16

Differences
◦ The expression is not enclosed by separators
 No modifiers can be added
◦ Only simple subpatterns
◦ Only a few escape sequences

Functions
◦ ereg(), ereg_replace(), split()
◦ Each function has –i version (case insensitive)
 eregi() – case insensitive version of ereg()
◦ Deprecated since PHP 5.3
by Martin Kruliš (v1.0)
26. 2. 2015
17
by Martin Kruliš (v1.0)
26. 2. 2015
18