A URL (Uniform Resource Locator)
In our day by day web life ,we process thousands of urls. A URL provides a way to access a resources on the web, the hypertext system that operates over the Internet. We save them, we share with others and sometime we create them (yes you heard me right). The URL contains the name of the protocol to be used to access the resource and a resource name. A url have 2 importants parts. The first part of a URL identifies what protocol to use i.e. http or https. URL protocols include HTTP (Hypertext Transfer Protocol) and HTTPS (HTTP Secure) for web resources, “mailto” for email addresses, “ftp” for files on a File Transfer Protocol (FTP) server, and telnet for a session to access remote computers. Note that the protocol identifier and the resource name are separated by a colon and two forward slashes. The second part identifies the IP address or domain name where the resource is located i.e. (rizdeveloperk.wordpress.com) or sometimes is also contains a sub domain separated by a dot symbol
The resource name is the complete address to the resource. The format of the resource name depends entirely on the protocol used, but for many protocols, including HTTP, the resource name contains one or more of the following components:
- Host Name
- The name of the machine on which the resource lives.
- The pathname to the file on the machine.
- Port Number
- The port number to which to connect (typically optional).
- A reference to a named anchor within a resource that usually identifies a specific location within a file (typically optional).
For many protocols, the host name and the filename are required, while the port number and reference are optional. For example, the resource name for an HTTP URL must specify a server on the network (Host Name) and the path to the document on that machine (Filename); it also can specify a port number and a reference
A URL is the most common type of Uniform Resource Identifier (URI). URIs are strings of characters used to identify a resource over a network.. A URL is mainly used to point to a webpage, a component of a webpage or a program on a website. The resource name consists of:
- A domain name identifying a server or the web service; and
- A program name or a path to the file on the server.
Optionally, it can also specify:
- A network port to use in making the connection; or
- A specific reference point within a file — a named anchor in an HTML (Hypertext Markup Language) file.
The resources available at any url is access through a Domain Name System(DNS), which could be single server or cluster of servers running with different name on a system.
URL with WWW and Non WWW
It really doesn’t matter if you use http://www.techomentous.com or techmomentous.com. You can choose any depending on your views. Having 2 versions same time can cause problems. You can overcome this by forcing a version with 301 redirect from other version. A website can live at
example.com. It’s best for your site’s visibility to live at just one URL, or web address. There is no special advantage with any version, so it’s your choice. You have to create a 301 redirect to the URL you choose from the other URL.
History of URL
Uniform Resource Locators were defined in Request for Comments (RFC) 1738 in 1994 by Tim Berners-Lee, the inventor of the World Wide Web, and the URI working group of the Internet Engineering Task Force (IETF) as an outcome of collaboration started at the IETF Living Documents “Birds of a Feather” session in 1992.
The format combines the pre-existing system of domain names (created in 1985) with file path syntax, where slashes are used to separate directory and file names. Conventions already existed where server names could be prefixed to complete file paths, preceded by a double slash (
//). Berners-Lee later expressed regret at the use of dots to separate the parts of the domain name within URIs, wishing he had used slashes throughout, and also said that, given the colon following the first component of a URI, the two slashes before the domain name were unnecessary.
URL & Normalization
URL normalization (or URL canonicalization) is the process of picking the best URL from the available choices. It’s done to reduce and to have a standard URL than having many URL’s. URL normalization is performed by crawlers to determine if two syntactically different URLs are equivalent.
The ultimate aim of the URL normalization is to reduce redundant Web crawling by having a set of URLs which point to a unique set of Web pages and to improve search engines for better and unique results. URL normalization is deployed by search engines to determine the importance of Web pages as well as to avoid indexing same Web pages. URL normalization is also refers as the process of identifying the similar and equivalent URL’s. The equivalent URL’s points to the same required resource which is in web user’s interest.
URL (Uniform Resource Locator) normalization is an important activity in web mining. Web data can be retrieved in smoother way using effective URL normalization technique. URL normalization also reduces lot of calculations in web mining activities. Web page redirection and forward graphs can be used to measure the similarities between the URL’s and can also be used for URL clusters. The URL clusters can be used for URL normalization.
Canonical URL is not a correct term, Canonicalization as mentioned above, is the process of picking the best URL from the available ones. More than one URL’s pointing to same page is often called Canonical URL, but it’s not a valid term as we mentioned above. Few example for Canonical URL’s.
- example.com/index.html (if html)
- example.com/index.jsp(if java)
- example.com/index.php (if php)
- example.com/home.asp (if IIS)
In Most of Web sites the above URL displays same content,But technically all of these URL’s are different. A web server could return completely different content for all the URL’s above. Search engine will only consider one of them to be the canonical form of the URL. So it’s necessary that you make a choose a prefered one and make 301 redirect for other versions to the prefered one, in order to prevent duplicate content and get high search ranking.