Thứ Tư, 15 tháng 5, 2019

CLEAN URL REWRITES USING NGINX

This article will cover how to easily implement Clean URLs (also known as Semantic URLs, RESTful URLs, User-Friendly URLs and Search Engine-Friendly URLs) using NGINX Web Server; currently the second most popular web server platform worldwide.
Not using NGINX web server? No problem, Apache users can check out my article Clean URL Rewrites Using Apache.
A Clean URL is a URL that does not contain query strings or parameters. This makes the URL easier to read and more understandable to users. 

Clean URLs are a high ranking factor for many search engines, but there are many other reasons why having Clean URLs can be important to your website. Check out our article Are your URLs Letting Your Website Down for more information.

THE PROBLEM

Take a look at this URL used by a dynamically generated web page:

http://www.exampleshop.co.uk/products.php?id=7

The example shows a common URL structure that you may often see output from a PHP driven CMS (Content Management System). The URL links to a dynamic PHP web page called products.php with a variable value of 7. A dynamic web page is a web page that is constructed using server side scripting, for example PHP. Server side scripting allows you to use URL parameters to determine the assembly of a dynamic web page, as shown in the example ‘id=7’. 

Functionally this URL is perfect; it links to the dynamic products page, and passes the variable ‘id’ with the value of 7 to the webserver to dynamically generate the content for the product associated with ID value of 7. 

This is great but the URL seems a bit archaic, difficult to remember, it contains no relevant keywords for search engines; and it doesn’t describe the content of the webpage very clearly. 

The webpage in the example is actually a shop page for buying organic apples, but there is no way of telling this from the current URL. 

Here is an example of a more ideal URL:

http://www.exampleshop.co.uk/buy/organic-fruit/apples

Our example now contains multiple keywords for search crawlers and a clear description of the contents of the webpage. 

But how is the webserver supposed to handle this Clean URL? There is no reference to the dynamic page ‘products.php’ or the variable ‘id’ of the product for the webserver to pull the correct dynamic content from.

THE SOLUTION

The easiest solution is to use a rewrite engine to modify the appearance of the URL, most commonly known as URL Rewriting or URL Manipulation. 
Fortunately, this solution only requires minimal restructuring or renaming of folders, dynamic files or variables. URL Rewriting is very flexible and fast to implement.

WHAT IS URL REWRITING?

URL Rewriting is the technique used for URL mapping or routing in a web application. By using URL Rewriting we can provide the information our webserver requires to interpret our Clean URL in our previous example.

NGINX WEB SERVER

Released in 2004 NGINX, pronounced ‘engine x’, has gained notable popularity in the past 12 years. NGINX became the second most used web server software worldwide in 2012, overtaking Microsoft ILS Web Server. 
NGINX is available for Windows and many UNIX server operating system flavours.

INITIATING THE REWRITE ENGINE

NGINX comes with the pre-installed module ngx_http_rewrite_module, so there should be no initiating required, unless your install is missing the module. 

To start creating rewrite rules in NGINX all we need to do is locate the NGINX configuration file that we will be adding our rules to.
  • Locate and open the NGINX configuration file nginx.conf with a text editor (Such as Sublime Text), the default location for the configuration depends on your installation type, the default locations are usually: /usr/local/nginx/conf, /opt/nginx/conf, /etc/nginx, or /usr/local/etc/nginx.
  • The configuration file will contain multiple code blocks. First navigate to the http block.
    
    http {
    }
    
    
  • We are looking for the server block nested inside the http block.
    
    http {
        server {
        }
    }
    
    
    There may already be multiple server blocks, distinguished by the port the block listens to or server name. 
    For now create a new empty server block, unless you already know which existing server block you would like to use for your rewrite rules.

BASIC URL REWRITING

To create a URL Rewrite we first need to create a location block inside the server block. The location block’s directives are tested against the URL specified in the request’s header.

http {
    server {
        location {

        }
    }
}

We have for example the following URL:

http://www.example.co.uk/1212aJlmo.html

But we want our users to instead be able to access this page via this URL:

http://www.example.co.uk/photoshop-tutorials

We can create a basic location block to accomplish this rewrite:

http {
    server {
     location = /photoshop-tutorials {
            rewrite ^/photoshop-tutorials?$ /1212aJlmo.html break;
       }
    }
}

Now let’s break this location block down and take a closer look at how it works.
  • location = /photoshop-tutorials – This is the prefix matching for our location block, we have used an = sign in our syntax, this will mean that the location block will only match with a URL containing the exact prefix specified. The location block will match with the following URL:
    
    http://www.example.com/photoshop-tutorials
    
    
    but not with:
    
    http://www.example.com/photoshop-tutorials/test.html
    
    
    or
    
    http://www.example.com/photoshop-tutorials2
    
    
    If we remove the = sign, the location block will match with all of the above URLs or any URL that begins with “photoshop-tutorials”. You can also use location ~ for matching blocks with regular expressions or location ~* for case-insensitive matches. We will be using these types of location block later in the article.
  • rewrite – Tells NGINX that the following refers to one single Rewrite Rule.
  • ^photoshop-tutorials?$ - The ‘pattern’ that the webserver will look for in the URL, if found the webserver will swap the pattern for the following substitution.
  • 1212aJlmo.html – The ‘substitution’, the webserver will swap the pattern for the substitution if the pattern is found in the URL.
  • ^, ? and $ - These are Regular Expression, also known as Rational Expression, characters; they are a sequence of characters that define a search pattern and are mainly used in pattern matching and string matching. The pattern is treated as a regular expression by default. In our example pattern we are using three regular expression characters:
    • ^ represents the beginning of a string.
    • $ represents the end of a string.
    • ? is known as the non-greedy modifier. In our example this modifier will stop our regular expression from repeating after matching our pattern for the first time, this is ‘non-greedy’ behaviour. ‘Greedy’ behaviour would be to look for more pattern matches.
  • break – This is known as a flag. Flags are added to the end of the rewrite rule and tell NGINX how to interpret the rule. In this example the break flag lets NGINX know not to execute any further location blocks if the current rule applies. There are more flags to choose from beyond this example; we will look at them in more detail later in the article.
Consequently when a user now inputs the URL:

    http://www.example.co.uk/photoshop-tutorials

NGINX Web Server will display the following page, without the user knowing any different:

http://www.example.co.uk/1212aJlmo.html

We have now grasped the basic technique of rewriting a single URL to a different URL. 

However, our dynamic page has several variables, and using this technique means a lot of work duplicating rewrite rules for every variable. 

In the next section we will cover more advanced patterns that can solve this.

DYNAMIC REWRITING USING BACK REFERENCES

If we go back to our original problem:

http://www.exampleshop.co.uk/products.php?id=7

We have the variable ‘id=7’ in this URL; but overall we have 150 products, each with a different ID. 

We want the URLs to look like the following example:

http://www.exampleshop.co.uk/product/1/
http://www.exampleshop.co.uk/product/2/
http://www.exampleshop.co.uk/product/3/
http://www.exampleshop.co.uk/product/4/
http://www.exampleshop.co.uk/product/5/
http://www.exampleshop.co.uk/product/6/
http://www.exampleshop.co.uk/product/7/
# etc..

It would take a long time to write individual rewrite rules for all of the possible URLs. 

By using the following ‘location prefix’, ‘pattern’ and ‘substitution’ we can save time and also avoid pages of duplicate code:

location /product/ {
    rewrite ^product/([0-9]+)/?$ products.php?id=$1 break;
}

Now let’s break this location block down and take a look at how it works:
  • location /product/ - The location block will match with any URL that begins with /product/.
  • ([0-9+]) – There are two key points to note on this part of the pattern.
    • Take a look at the contents of our brackets [0-9]+. This is a regular expression, in a regular expression the square brackets [] mean match any of the contents. For example: if we were to use [1A] the regular expression would match for the characters 1 and A. 
      Here the square brackets [] contain a range of characters: 0-9 which indicates all digits between and including 0 and 9. The + symbol is a regular expression special character that has the special meaning of “match one or more of the preceding”. In the example pattern we have placed the + after our range [0-9] to detect one or more characters within our range; without the + the pattern will only match with one digit in our range, for example: 1 or 5 but not 11 or 15.
    • The parentheses () in a regular expression refer to a backreference. The $1 in our substitution links to this backreference. For example, if the following URL was input:
      
      http://www.exampleshop.co.uk/product/127
      
      
      127 would be matched to our range in the backreference, resulting in the following substitution:
      
      http://www.exampleshop.co.uk/products.php?id=127
      
      
      There can be multiple backreferences in a pattern, for example:
      
      rewrite ^product/([0-9]+)/([0-9]+)?$ products.php?id=$1&cost=$2 break;
      
      
      Backreferences are numbered in the order they appear. In the example above there are two backreference groups in the pattern, the first group linking to $1 in the substitution and the second group linking to $2. If we were to add a third group it would be numbered $3 and so on.
  • $1 – Is located in our substitution and links to our first backreference, which located in the parentheses in our pattern: ([0-9]+).

SOLVING THE PROBLEM USING REGULAR EXPRESSIONS

The scope of what you can do with regular expressions is so large that it really deserves its own article, therefore, we will only focus on the regular expression we require for the following problem for now:

http://www.exampleshop.co.uk/products.php?id=7

to

http://www.exampleshop.co.uk/buy/organic-fruit/apples

We could successfully use the basic rewrite technique from the start of the article to create one single rule to rewrite this URL. However, in this situation we will assume, as in the previous section, that there are multiple products. We do not want to create individual rewrite rules for every product as this would be very time consuming.
The problem here, that can’t be easily solved with NGINX Rewrite Rules, is finding the name of our product related to ‘id=7’ in our database.
This can be done with a PRG: External Rewriting Program but I highly recommend against doing so, as you may find using this technique will cause you countless problems such as: buffering issues, random results returned and many more undesirables.
The problem will need to be resolved through the database and backend code. My recommended solution would be to add a new column to your database table labelled something similar to ‘product_name’ which will contain the name of the product. 

Your URL would then become, for example:

http://www.exampleshop.co.uk/products.php?product_name=apples

The work required to make this change is minimal. Even if your website’s backend code relies heavily on the ‘id’ variable from the URL, you can still obtain this variable easily by querying the database at the start of your code to assign a variable for the ID where ‘product_name’ is equal to ‘apples’. 

In the previous section we created a dynamic rewrite rule using back references and a regular expression in our pattern to detect multiple digits. We are now going to create a similar rule using back references with a different regular expression pattern for use with our new ‘product_name’ variable.

location /buy/organic/fruit {
    rewrite ^buy/organic-fruit/([a-z]+)/?$ products.php?product_name=$1 break;
}

Previously our range was [0-9]. Because our new variable product_name contains letters instead of numerals we have changed the range to [a-z]: to detect all lower case letters between a and z. 

What if we want our range to also detect numerals, uppercase letters and even hyphens for product names with a hyphen separator such as ‘apple-juice’? 

You can simply add more characters to the range like so: [a-zA-Z0-9-]. Notice how the hyphen has been added to the end of the range, it is added to the end so that it is treated literally rather than as a range separator.

SPECIAL CHARACTERS TO AWARE OF WHEN USING REGULAR EXPRESSIONS

There are certain characters known as Special Characters that we need to be aware of when using regular expressions as they have special meanings. 

A good example of a regular expression special character is the period ‘.’character. 

It is quite common for a period to be used in a pattern that includes a file, for example:

rewrite ^index.html/?$ index.php break;

This rewrite rule will work in substituting index.html for index.php. 
The problem is that it will also work for substituting index^html, indexshtml, index1html for index.php. 

This occurs because the period is a special character in a regular expression with the special meaning: “Any character”. Therefore, the rewrite rule will work with any character that substitutes the position of the period. 

To use a period as a Literal Character (i.e. without it’s special meaning) in a regular expression you need to ‘escape’ the period with a preceding back slash: ‘\.’. 
For example:

rewrite ^index\.html/?$ index.php break;

Note that we do not need to escape the period used in the substitution, as it is only the pattern that is treated as a regular expression in the rewrite rule.
Regular Expression Special Characters include:
  • * (zero of more of the preceding)
  • . (any character)
  • + (one or more of the preceding)
  • {} (minimum to maximum quantifier)
  • ? (non-greedy modifier)
  • ! (negative modifier)
  • ^ (start of a string, or ‘negative’ if used at the start of a range)
  • $ (end of a string)
  • [] (match any of contents)
  • - (range when used between square brackets)
  • () (backreference group)
  • | (or)
  • \ (the escape character)
The escape character (backslash) itself also needs to be escaped when used as a literal character. Depending on the programming language and parser you may need to use four backslashes instead of two ‘\\\\’, this is because the programming language may also be using a backslash as an escape character.

FLAGS

Throughout this article we have used the flag break in our examples: to notify NGINX that no following rules should be applied when the current rewrite rule is in use. 

However, this is not the only flag at our disposable when creating rewrite rules. The following flags can also be used in your rules to notify NGINX of other information:
  • last (stops processing the current set of directives and starts a new search for a matching location with the changed URL)
  • break (set specified cookie, replace cookie with value)
  • redirect (returns a temporary redirect with code 302)
  • permanent (returns a permanent redirect with code 301)

ORDER OF PRECEDENCE WITH LOCATION BLOCKS

The order in which location blocks are processed is not as obvious as it may seem. Location blocks are not processed in cascading order, the type of location block determines the order of precedence. The order of precedence is as follows:
  1. Exact location blocks (location =)
  2. Regex locations blocks (location ~ or location ~*)
  3. Standard location blocks (location)
This is useful to know as you can create advanced rewrite rules with fall-back options by knowing the order of precedence. 

For example:

location =  /product/checkout {
    rewrite ^product/checkout/?$ /checkout.php break;
}

location ~  /product/[a-z]/ {
    rewrite ^product/([a-z]))/?$ /products.php?product_name=$1 break;
}

location /product/ {
    rewrite ^/product/?$ products.php break;
}

If we were to input the URL:

http://www.exampleshop.co.uk/product/checkout

All 3 of the above location blocks would match, but because we have specified an exact location block this block will be matched first and the break flag will stop the other location blocks from processing.

SUMMARY

If you also use Apache Web Server, make sure to check out my article on how to create Clean URL Rewrites Using Apache. 

If you require any help creating a specific rewrite rule or require any further information, please feel free to post your question in the comments.

Không có nhận xét nào:

Đăng nhận xét