[UD] HTTP Programming Recipes for Java Bots
Jeff Heaton | 2010 | English | 680 pages | PDF | 47.86 MbDescription:
The Hypertext Transfer Protocol (HTTP) allows information to be exchanged between a web server and a web browser. Java allows you to program HTTP directly. HTTP programming allows you to create programs that access the web much like a human user would. These programs, which are called bots, can collect information or automate common web programming tasks. This book presents a collection of very reusable recipes for Java bot programming.
This book also introduces the Heaton Research Spider. The Heaton Research Spider is an open source spider framework. Using the Heaton Research Spider you can create spiders that will crawl a web site, much like a real spider crawls the web. The Heaton Research Spider is available in both Java and Microsoft Dot Net form.
Chapter 1 of this book begins by examining the structure of HTTP requests. If you are to create programs that make use of the HTTP protocol it is important to understand the structure of the HTTP protocol. This chapter explains what packets are exchanged between web servers and web browsers, as well as the makeup of these packets.
Chapter 2 shows how to monitor the packets being transferred between a web server and web browser. Using a program, called a Network Analyzer, you can quickly see what HTTP packets are being exchanged. To create a successful bot, your bot must exchange the same packets with the web server that a user would. A Network Analyzer can help quickly create a bot by showing you
From Chapter 3 and beyond this book is structured as a set of recipes. You are provided with short concise programming examples for many common HTTP programming tasks. Most of the chapters are organized into two parts. The first part introduces the topic of the chapter. The second part is a collection of recipes. These recipes are meant to be starting points for your own programs that will require similar functionality.
Chapter 3 shows how to execute simple HTTP requests. A simple HTTP request is one that accesses only a single web page. All data that is needed will be on that page and no additional information must be passed to the web server.
Chapter 4 goes beyond simple requests and shows how to make use of other features of the HTTP protocol. HTTP server and client headers are introduced. Additionally, you will be shown how to access data from basic HTML files.
Chapter 5 shows how to use HTTPS. HTTPS is the more secure version of HTTP. Use of HTTPS is generally automatic in Java. However, you will be shown some of the HTTPS specific features that Java provides, and how to use them. You will also be introduced to HTTP authentication, which is a means by which the web server can prompt the user for an id and password.
Chapter 6 shows how to access data from a variety of HTML sources. An HTML parser is developed that will be used with most of the remaining recipes in this book. You are shown how to use this parser to extract data from forms, lists, tables and other structures. Recipes are provided that will serve as a good starting point for any of these HTML constructs.
Chapter 7 shows how to interact with HTML forms. HTML forms are very important to web sites that need to interact with the user. This chapter will show how to construct the appropriate response to an HTML form. You are shown how each of the control types of the form interacts with the web server.
Chapter 8 shows how to handle cookies and sessions. You will see that the web server can track who is logged on and maintain a session using either cookies or a URL variable. A useful class will be developed that will handle cookie processing in Java.
Chapter 11 introduces web services. Web services have replaced many of the functions previously performed by bots. Sites that make use of web services provide access to their data through XML. This makes it considerably easier to access their data than writing a traditional bot. Additionally, you can use web services in conjunction with regular bot programming. This produces a hybrid bot.
Chapter 12 shows how to create bots that make use of RSS feeds. RSS is an XML format that allows quick access to the newest content on a web site. Bots can be constructed to automatically access RSS information from a web site.
Chapter 13 introduces the Heaton Research Spider. The Heaton Research Spider is an open source implementation of a Java spider. There is also a C# version of the Heaton Research Spider. A spider is a program that is designed to access a large number of web pages. The spider does this by continuously visiting the links of web pages, and then pages found at those links. A web spider visits sites much as a biological spider crawls its web.
The remaining chapters of the chapters of this book do not include recipes. Chapters 14 and 15 explain how the Heaton Research Spider works. Chapter 16 explains how to create well behaved bots.
Chapter 14 explains the internals of the Heaton Research Spider. The Heaton Research Spider is open source. Because of this you can modify it to suit your needs. Chapter 14 discusses the internal structure of the Heaton Research Spider. By default the Heaton Research Spider uses computer memory to track the list of visited URLs. This chapter explains how this memory based URL tracking works. The next chapter explains how to use an SQL database instead of computer memory.
Chapter 15 explains how the Heaton Research Spider makes use of databases. The Heaton Research Spider can use databases to track the URLs that it has visited. This allows the spider to access a much larger volume of URLs than when using computer memory to track the URL list.
The book ends with Chapter 16 which discusses how to create "Well Behaved Bots". Bots are not welcome on all web sites. Some web sites publish files that outline how bots are to access their site. It is very important to respect the wishes of the web master when creating a bot.
From the same author
|| Contact author ||